1 Introduction

The Lempel-Ziv (LZ) factorization [20] of a string has played an important role in data compression for more than 40 years and it is also the basis of important algorithms on strings, such as the detection of all maximal repetitions (runs) in a string [16] in linear time. Because of its importance in data compression, there is extensive literature on algorithms that compute the LZ-factorization and [1,2,3,4, 10, 12, 13, 18, 19] is an incomplete list.

A variant of the LZ-factorization is the f-factorization, which played an important role in solving a long standing open problem: it enabled the development of the first linear time algorithm for seeds computation by Kociumaka et al. [15].

Definition 1

Let \(S=S[0..n-1]\) be a string of length n on an alphabet \(\varSigma \). The f-factorization \(s_1s_2\cdots s_m\) of S can be defined as follows. Given \(s_1s_2\cdots s_{j-1}\), the next factor \(s_j\) is obtained by a case distinction on the character \(c=S[i]\), where \(i = |s_1s_2\cdots s_{j-1}|\):

  1. (a)

    if c does not occur in \(s_1s_2\cdots s_{j-1}\) then \(s_j=c\)

  2. (b)

    else \(s_j\) is the longest prefix of \(S[i..n-1]\) that is a substring of \(s_1s_2\cdots s_{j-1}\).

The difference to the LZ-factorization is that the factors must be non-overlapping. There are two linear time algorithms that compute the f-factorization [5, 6]. Both of them compute the \(\mathsf {LPnF}\)-array (defined below), from which the f-factorization can be derived (in case (b), the factor \(s_j\) is the length \(\mathsf {LPnF}[i]\) prefix of \(S[i..n-1]\)).

Fig. 1.
figure 1

The \(\mathsf {LPnF}\), \(\mathsf {LPF}\), and \(\mathsf {prevOcc}\) arrays of the string \(S=aaaaaaaaaaaaaaaa\).

Definition 2

For a string S of length n, the longest previous non-overlapping factor \((\mathsf {LPnF})\) array of size n is defined for \(0\le i < n\) by

$$ \mathsf {LPnF}[i] = \max \{\ell \mid 0\le \ell \le n-i ; S[i..i+\ell -1] \text{ is } \text{ a } \text{ substring } \text{ of } S[0..i-1]\} $$

In the following, we will give a simple algorithm that directly bases the computation of the \(\mathsf {LPnF}\)-array on the \(\mathsf {LPF}\)-array, which is used in several algorithms that compute the LZ-factorization. The \(\mathsf {LPF}\)-array is defined by (\(0\le i < n)\)

$$ \mathsf {LPF}[i] = \max \{\ell \mid 0\le \ell \le n-i ; S[i..i+\ell -1] \text{ is } \text{ a } \text{ substring } \text{ of } S[0..i+\ell -2]\} $$

In data compression, we are not only interested in the length of the longest previous factor but also in a previous position at which it occurred (because otherwise decompression would be impossible). For an \(\mathsf {LPF}\)-array, the positions of previous occurrences are stored in an array \(\mathsf {prevOcc}\). If \(\mathsf {LPF}[i] = 0\), we set \(\mathsf {prevOcc}[i] = \bot \) (for decompression, one can use the definition \(\mathsf {prevOcc}[i] = S[i]\)). Figure 1 depicts the \(\mathsf {LPF}\)-array of \(S=a^{16}\) and two of many possible instances of the \(\mathsf {prevOcc}\)-array: one that stores the rightmost (rm) positions of occurrences of longest previous factors and one that stores the leftmost (lm) positions.

2 Computing \(\mathsf {LPnF}\) from \(\mathsf {LPF}\)

Algorithm 1 computes the \(\mathsf {LPnF}\)-array by a right-to-left scan of the \(\mathsf {LPF}\)-array and its \(\mathsf {prevOcc}\)-array. The computation of an entry \(\ell =\mathsf {LPnF}[i]\) is solely based on entries \(\mathsf {LPF}[j]\) and \(\mathsf {prevOcc}[j]\) with \(j\le i\). Consequently, after the calculation of \(\ell \), it can be stored in \(\mathsf {LPF}[i]\). Since Algorithm 1 overwrites the \(\mathsf {LPF}\)-array with the \(\mathsf {LPnF}\)-array (and the \(\mathsf {prevOcc}\)-array of \(\mathsf {LPF}\) with the \(\mathsf {prevOcc}\)-array of \(\mathsf {LPnF}\)), no extra space is needed. Algorithm 1 is based on the following simple idea:

  1. 1.

    If the factor starting at position i and its previous occurrence starting at position \(j=\mathsf {prevOcc}[i]\) do not overlap, then clearly \(\mathsf {LPnF}[i] = \mathsf {LPF}[i]\).

  2. 2.

    Otherwise, the length of the (currently best) previous non-overlapping factor is \(\ell =i-j\). A longer previous non-overlapping factor exists if \(\mathsf {LPF}[j] > \ell \) (note that \(\mathsf {LPF}[i] > \ell \) holds): the prefix of \(S[i..n-1]\) of length \(\min \{\mathsf {LPF}[i],\mathsf {LPF}[j]\}\) also occurs at position \(\mathsf {prevOcc}[j]\) and even if the two occurrences (starting at i and \(\mathsf {prevOcc}[j]\)) overlap, their non-overlapping part must be greater than \(\ell \) because \(\mathsf {prevOcc}[j]<j\).

  3. 3.

    Step 2 is repeated until there is no further candidate (condition of the while-loop in line 7) or the two occurrences under consideration do not overlap (line 11 of Algorithm 1).

figure a

On the one hand, the example in Fig. 1 shows that Algorithm 1 may have a quadratic run-time if it uses the \(\mathsf {prevOcc}\)-array that stores the rightmost positions of previous occurrences. On the other hand, the next lemma proves that Algorithm 1 has a linear worst-case time complexity if it uses the \(\mathsf {prevOcc}\)-array that stores the leftmost positions of previous occurrences. Its proof is based on the following notion: An integer p with \(0<p\le |\omega |\) is called a period of \(\omega \in \varSigma ^+\) if \(\omega [i]=\omega [i+p]\) for all \(i\in \{0,1,\dots ,|\omega |-p-1\}\).

Lemma 1

If \(\mathsf {prevOcc}\) stores the leftmost positions of previous occurrences, then the else-case on line 12 in Algorithm 1 cannot occur.

Proof

For a proof by contradiction, suppose that the else-case on line 12 in Algorithm 1 occurs for some i. We have \(\mathsf {LPF}[i]>0\), \(j=\mathsf {prevOcc}[i]\) is the leftmost occurrence of the longest previous factor \(\omega _i\) starting at i, and \(j+ \mathsf {LPF}[i] > i\). Suppose \(\mathsf {LPF}[j] > i-j\), i.e., the while-loop is executed. Let \(m = \min \{\mathsf {LPF}[i],\mathsf {LPF}[j]\}\) and \(k=\mathsf {prevOcc}[j]\). If \(m = \mathsf {LPF}[i]\), then it would follow that an occurrence of \(\omega _i\) starts at k. This contradicts the fact that j is the leftmost occurrence of \(\omega _i\). Consequently, \(m = \mathsf {LPF}[j] < \mathsf {LPF}[i]\). The else-case on line 12 occurs when \(k+ m > i\). This implies \(k+ m > j\) because \(i > j\). Let \(\omega _j\) be the longest previous factor starting at j. Let \(a=S[k+m]\) (the character following the occurrence of \(\omega _j\) starting at k) and \(b=S[j+m]\) (the character following the occurrence of \(\omega _j\) starting at j); see Fig. 2. By definition, \(a\ne b\). We will derive a contradiction by showing that \(a=b\) must hold in the else-case on line 12.

Since \(k+ m > j\), the occurrence of \(\omega _j\) starting at k overlaps with the occurrence of \(\omega _j\) starting at j. Let u be the non-overlapping part of the occurrence of \(\omega _j\) starting at k, i.e., \(u = S[k..j-1]\). Because the occurrence of \(\omega _j\) starting at j has u as a prefix and overlaps with the occurrence of \(\omega _j\) starting at k, it follows that |u| is a period of \(\omega _j\); see Fig. 2. By a similar reasoning, one can see that |v| is a period of \(\omega _i\), where \(v = S[j..i-1]\). Since \(\omega _j\) is a length m prefix of \(S[j..n-1]\) and \(\omega _i\) is a length \(\mathsf {LPF}[i]\) prefix of \(S[j..n-1]\), where \(m = \mathsf {LPF}[j] < \mathsf {LPF}[i]\), it follows that \(\omega _j\) is a prefix of \(\omega _i\). Hence |v| is also a period of \(\omega _j\). In summary, both |u| and |v| are periods of \(\omega _j\). Fine and Wilf’s theorem [8] states that if \(|\omega _j| \ge |u| + |v| - gcd(|u|,|v|)\), then the greatest common divisor gcd(|u|, |v|) of |u| and |v| is also a period of \(\omega _j\). Since \(m = |\omega _j| \ge |u| + |v|\), the theorem is applicable. Let \(\gamma \) be the length gcd(|u|, |v|) prefix of \(\omega _j\). It follows that \(v=\gamma ^q\) for some integer \(q>0\), hence \(|\gamma |\) is a period of \(\omega _i\). Recall that \(a=S[k+m]\) is the character \(\omega _j[m-|u|] = \omega _i[m-|u|]\) and \(b=S[j+m] = \omega _i[m]\). We derive \(a = \omega _i[m-|u|] = \omega _i[m] = b\) because \(|\gamma |\) is a period of \(\omega _i\) and |u| is a multiple of \(|\gamma |\). This contradiction proves the lemma.

Fig. 2.
figure 2

Proof of Lemma 1: i, j, and k are positions, while a and b are characters.

To the best of our knowledge, Abouelhoda et al. [1] first computed the LZ-factorization based on the suffix array (and the \(\mathsf {LCP}\)-array) of S. Their algorithm computes the \(\mathsf {LPF}\)-array and the \(\mathsf {prevOcc}\)-array that stores leftmost positions of previous occurrences of longest factors. So the combination of their algorithm and Algorithm 1 gives a linear-time algorithm that computes the \(\mathsf {LPnF}\)-array. Subsequent work (e.g. [2,3,4, 12, 13, 18]) concentrated on LZ-factorization algorithms that are faster in practice or more space-efficient (or both). Some of them also first compute the arrays \(\mathsf {LPF}\) and \(\mathsf {prevOcc}\), but their \(\mathsf {prevOcc}\)-arrays neither store leftmost nor rightmost occurrences (in fact, these algorithms are faster because they use lexicographically nearby suffixes–a local property–while being the leftmost occurrence is a global property). However, leftmost occurrences can easily be obtained by Algorithm 2. The algorithm is based on the following simple observation: If \(\mathsf {LPF}[i]>0\), \(j=\mathsf {prevOcc}[i]\), and \(\mathsf {LPF}[j]\ge \mathsf {LPF}[i]\), then \(\mathsf {prevOcc}[j]\) is also the starting position of an occurrence of the factor starting at i. Since \(\mathsf {prevOcc}[j] < j\), an occurrence left of j has been found. The while-loop in Algorithm 2 repeats this procedure until the leftmost occurrence is found. Note that the algorithm overwrites the \(\mathsf {prevOcc}\)-array. Consequently, if its for-loop is executed for i, then for every \(0\le j < i\), \(\mathsf {prevOcc}[j]\) stores a leftmost position. The next example shows that Algorithm 2 is not linear in the worst-case. Consider the string \(S=a^1\#_1a^2\#_2a^3\#_3a^4\#_4\dots a^m\#_m\), where \(m > 0\) and \(\#_k\) are pairwise distinct separator symbols. Clearly, the length of S is \(n = m + \sum ^m_{k=1} k = m + m(m+1)/2 = m(m+3)/2\). If Algorithm 2 is applied to the arrays \(\mathsf {LPF}\) and \(\mathsf {prevOcc}\) in Fig. 3, it computes the leftmost (lm) \(\mathsf {prevOcc}\) array and the number of iterations of its while-loop (last row in Fig. 3) is \(\sum ^{m-1}_{j=1} \sum ^j_{k=1} k = (\sum ^{m-1}_{j=1} j^2 + \sum ^{m-1}_{j=1} j)/2 = (m-1)m(m+1)/6\).

figure b
Fig. 3.
figure 3

\(\mathsf {LPF}\) and \(\mathsf {prevOcc}\) arrays of the string \(S=a^1\#_1a^2\#_2a^3\#_3a^4\#_4\).

3 Direct Computation of the f-Factorization

Algorithm 3 computes the f-factorization of S based on backward search on \(T=S^{rev}\) and range maximum queries (\(\mathsf {RMQ}\)s) on the suffix array of T.Footnote 1 It uses ideas of [2, Algorithm CPS2] and [18, Algorithm LZ_bwd]. In fact, Algorithm 3 computes the right-to-left f-factorization of the reverse string \(S^{rev}\) of S. It is not difficult to see that \(s_1s_2\cdots s_m\) is the (left-to-right) f-factorization of S if and only if \(s^{rev}_m\cdots s^{rev}_2 s^{rev}_1\) is the right-to-left f-factorization of \(S^{rev}\). In this subsection, we assume a basic knowledge of suffix arrays (\(\mathsf {SA}\)), the Burrows-Wheeler transform (\(\mathsf {BWT}\)), and wavelet trees; see e.g. [7, 18]. Given a substring \(\omega \) of T, there is a suffix array interval [sp..ep]— called the \(\omega \)-interval—so that \(\omega \) is a prefix of every suffix \(T[\mathsf {SA}[k]..n]\) if and only if \(sp\le k \le ep\). For a character c, the \(c\omega \)-interval can be computed by one backward search step backwardSearch(c, [sp..ep]); this takes \(O(\log |\varSigma |)\) time if backward search is based on the wavelet tree of the \(\mathsf {BWT}\) of T. A linear time preprocessing is sufficient to obtain a space-efficient data structure that supports \(\mathsf {RMQ}\)s in constant time; see [9] and the references therein. \(\mathsf {RMQ}(sp,ep)\) returns the index of the maximum value among \(\mathsf {SA}[sp],\mathsf {SA}[sp+1],\dots ,\mathsf {SA}[ep]\); hence \(\mathsf {SA}[\mathsf {RMQ}(sp,ep)]\) is the maximum of these \(\mathsf {SA}\)-values. Suppose Algorithm 3 has already computed \(s^{rev}_{j-1}\cdots s^{rev}_2 s^{rev}_1\) and let \(i=n - (|s^{rev}_{j-1}\cdots s^{rev}_2 s^{rev}_1|+1)\). It computes the next factor \(s^{rev}_j\) as follows. First, it stores the starting position i in a variable pos. In line 6, backwardSearch(T[i], [0..n]) returns the c-interval [sp..ep], where \(c=T[i]\). In line 7, the maximum max of \(\mathsf {SA}[sp],\mathsf {SA}[sp+1],\dots ,\mathsf {SA}[ep]\) is determined. If \(max = pos\) (\(max < pos\) is impossible because \(c=T[pos]\)), then there is no occurrence of c in \(T[pos+1..n]\), so that \(s^{rev}_j = c\) (the algorithm outputs 0, meaning that the next factor is the next character). Otherwise, there is a previous occurrence of c at position \(max > pos\) and the process is iterated, i.e., i is decremented by one and the new T[i..pos]-interval is computed etc. Consider an iteration of the repeat-loop, where [sp..ep] is the T[i..pos]-interval for some \(i<pos\). The repeat-loop must be terminated early (line 9) if \(max \le pos\) because then the rightmost occurrence of T[i..pos] starts left of \(pos+1\). In other words, T[i..pos] is not a substring of \(T[pos+1..n]\). Since the repeat-loop did not terminate in the previous iteration, \(T[i+1..pos]\) is a substring of \(T[pos+1..n]\) that has a previous occurrence at position \(m > pos\), where m is the maximum \(\mathsf {SA}\)-value of the previous iteration. So \(s^{rev}_j = T[i+1..pos]\) and the algorithm outputs its length \(|s^{rev}_j| = pos - (i+1) +1 = pos - i\), which coincides with \(|s_j|\). Note that the algorithm can easily be extended so that it also computes positions of previous occurrences. Algorithm 3 has run-time \(O(n\log |\varSigma |)\) because one backward search step takes \(O(\log |\varSigma |)\) time.

figure c

Kolpakov and Kucherov [17] used the reversed f-factorization (they call it reversed LZ-factorization) for searching for gapped palindromes. The reversed f-factorization is defined by replacing case (b) in Definition 1 with: (b) else \(s_j\) is the longest prefix of \(S[i..n-1]\) that is a substring of \((s_1s_2\cdots s_{j-1})^{rev}\). It is not difficult so see that Algorithm 3 can be modified in such a way that it computes the reversed f-factorization of S in \(O(n \log |\varSigma |)\) time (to find the next factor \(s_j\), match prefixes of \(S[i..n-1]\) against \(T=S^{rev}\)).

4 Experimental Results

Our implementation is based on the sdsl-lite library [11] and we experimentally compared it with the \(\mathsf {LPnF}\) construction algorithm of Crochemore and Tischler [6], called \(\texttt {CT}\)-algorithm henceforth. Another \(\mathsf {LPnF}\) construction algorithm is described in [5], but we could not find an implementation (this algorithm is most likely slower than the \(\texttt {CT}\)-algorithm because it uses two kinds of range minimum queries—one on the suffix array and one on the \(\mathsf {LCP}\)-array—and range minimum queries are slow; see below). The experiments were conducted on a 64 bit Ubuntu 16.04.4 LTS system equipped with two 16-core Intel Xeon E5-2698v3 processors and 256 GB of RAM. All programs were compiled with the O3 option using g++ (version 5.4.1). Our programs are publically available.Footnote 2 The test data—the files dblp.xml, dna, english, and proteins—originate from the Pizza & Chili corpus.Footnote 3 In our first experiment, we computed the \(\mathsf {LPnF}\)-array from the \(\mathsf {LPF}\)-array. Three algorithms that compute the \(\mathsf {LPF}\)-array were considered:

  • AKO: algorithm by Abouelhoda et al. [1]

  • LZ_OG: algorithm by Ohlebusch and Gog [18]

  • KKP3: algorithm by Kärkkäinen et al. [14]

It is known that AKO is slower than the others, but in contrast to the other algorithms it calculates leftmost \(\mathsf {prevOcc}\)-arrays. Thus, there was a slight chance that AKO in combination with Algorithm 1 is faster than LZ_OG or KKP3 in combination with Algorithm 1. However, our experiments showed that this is not the case. AKO is missing in Fig. 4 because the differences between the run-times of the other algorithms become more apparent without it. For the same reason, we did not take the suffix array construction time into account (note that each of the algorithms needs the suffix array). To find out whether or not it is advantageous to compute a leftmost \(\mathsf {prevOcc}\)-array by Algorithm 2 before Algorithm 1 is applied, we also considered the combinations of LZ_OG and KKP3 with both algorithms. Figures 4 and 5 show the results of the first experiment. As one can see in Fig. 4, for real world data it seems disadvantageous to apply Algorithm 2 before Algorithm 1 because the overall run-time becomes slightly worse. However, for ‘problematic’ strings such as \(a^n\) and \(a^nb\) it is advisable to use Algorithm 2: With it both LZ_OG and KKP3 outperformed the CT-algorithm (data not shown), but without it both did not terminate after 20 min. All in all, KKP3 in combination with Algorithms 1 and 2 is the best choice for the construction of the \(\mathsf {LPnF}\)-array. In particular, it clearly outperforms the CT-algorithm in terms of run-time and memory usage.

Fig. 4.
figure 4

Run-time comparison of \(\mathsf {LPnF}\)-array construction (without suffix array construction, which on average takes 50% of the overall run-time)

Fig. 5.
figure 5

Peak memory comparison of \(\mathsf {LPnF}\)-array construction (with suffix array construction)

In the second experiment, we compared Algorithm 3—the only algorithm that computes the f-factorization directly—with the other algorithms (which first compute the \(\mathsf {LPnF}\)-array and then derive the f-factorization from it). Algorithm 3 uses only 44% of the memory required by KKP3, but its run-time is by an order of magnitude worse (data not shown). We blame the range maximum queries for the rather bad run-time because these are slow in practice.