Keywords

1 Introduction

The suffix array contains the indices of all suffixes of a string arranged in lexicographical order. It is arguably one of the most important data structures in stringology, the topic of algorithms on strings and sequences. It was introduced in 1990 by Manber and Myers for on-line string searches [9] and has since been adopted in a wide area of applications including text indexing and compression [12]. Although the suffix array is conceptually very simple, constructing it efficiently is not a trivial task.

When n is the length of the input text, the suffix array can be constructed in \(\mathcal {O}(n)\) time and \(\mathcal {O}(1)\) additional words of working space when the alphabet is linearly-sortable (i.e. the symbols in the string can be sorted in \(\mathcal {O}(n)\) time) [7, 8, 10]. However, algorithms with these bounds are not always the fastest in practice. For instance, DivSufSort has been the fastest SACA for over a decade although having super-linear worst-case time complexity [3, 5]. To the best of our knowledge, the currently fastest suffix sorter is libsais, which appeared as source code in February 2021 on GithubFootnote 1 and has not been subject to peer review in any academic context. The author claims that libsais is an improved implementation of the SA-IS algorithm and hence has linear time complexity [11].

The only non-recursive linear-time suffix sorting algorithm GSACA was introduced in 2015 by Baier and is not competitive, neither in terms of speed nor in the amount of memory consumed [1, 2]. Despite the new algorithm’s entirely novel approach and interesting theoretical properties [6], there has been little effort in optimising it. In 2021, Bertram et al. [3] provided a faster SACA DSH using the same sorting principle as GSACA. Their algorithm beats DivSufSort in terms of speed, but also has super-linear time complexity.

Our Contributions. We provide a linear-time SACA that relies on the same grouping principle that is employed by DSH and GSACA, but is faster than both. This is done by exploiting certain properties of Lyndon words that are not used in the other algorithms. As a result, our algorithm is more than \(11\%\) faster than DSH on real-world texts and at least \(46\%\) faster than Baier’s GSACA implementation. Although our algorithm is not on par with libsais on real-world data, it significantly improves Baier’s sorting principle and positively answers the question whether the precomputed Lyndon array can be used to accelerate GSACA (posed in [4]).

The rest of this paper is structured as follows: Sect. 2 introduces the definitions and notations used throughout this paper. In Sect. 3, the grouping principle is investigated and a description of our algorithm is provided. Finally, in Sect. 4 our algorithm is evaluated experimentally and compared to other relevant SACAs.

This is an abridged version of a longer paper available on arXiv [13].

2 Preliminaries

For \(i,j\in \mathbb {N}_0\) we denote the set \(\left\{ k\in \mathbb {N}_0 : i \le k \le j\right\} \) by the interval notations \(\left[ i \, ..\,j\right] = \left[ i \, ..\,j+1\right) = \left( i-1 \, ..\,j\right] = \left( i-1 \, ..\,j+1\right) \). For an array A we analogously denote the subarray from i to j by \(A\left[ i \, ..\,j\right] = A\left[ i \, ..\,j+1\right) = A\left( i-1 \, ..\,j\right] = A\left( i-1 \, ..\,j+1\right) = A[i]A[i+1]\dots A[j]\). We use zero-based indexing, i.e. the first entry of the array A is A[0]. A string S of length n over an alphabet \(\varSigma \) is a sequence of n characters from \(\varSigma \). We denote the length n of S by \(\left| S\right| \) and the i’th symbol of S by \(S[i-1]\), i.e. strings are zero-indexed. Analogous to arrays we denote the substring from i to j by \(S\left[ i \, ..\,j\right] = S\left[ i \, ..\,j+1\right) = S\left( i-1 \, ..\,j\right] = S\left( i-1 \, ..\,j+1\right) = S[i]S[i+1]\dots S[j]\). For \(j > i\) we let \(S\left[ i \, ..\,j\right] \) be the empty string \(\varepsilon \). The suffix i of a string S of length n is the substring \(S\left[ i \, ..\,n\right) \) and is denoted by \(S_i\). Similarly, the substring \(S\left[ 0 \, ..\,i\right] \) is a prefix of S. A suffix (prefix) is proper if \(i>0\) (\(i + 1 < n\)). For two strings u and v and an integer \(k\ge 0\) we let uv be the concatenation of u and v and denote the k-times concatenation of u by \(u^k\). We assume totally ordered alphabets. This induces a total order on strings. Specifically, we say a string S of length n is lexicographically smaller than another string \(S'\) of length m if and only if there is some \(\ell \le \min \left\{ n,m\right\} \) such that \(S\left[ 0 \, ..\,\ell \right) = S'\left[ 0 \, ..\,\ell \right) \) and either \(n=\ell <m\) or \(S[\ell ] < S'[\ell ]\). If S is lexicographically smaller than \(S'\) we write \(S<_ lex S'\).

A non-empty string S is a Lyndon word if and only if S is lexicographically smaller than all its proper suffixes [14]. The Lyndon prefix of S is the longest prefix of S that is a Lyndon word. We let \(\mathcal {L}_i\) denote the Lyndon prefix of \(S_i\).

In the remainder of this paper, we assume an arbitrary but fixed string S of length \(n>1\) over a totally ordered alphabet \(\varSigma \) with \(\left| \varSigma \right| \in \mathcal {O}(n)\). Furthermore, we assume w.l.o.g. that S is null-terminated, that is \(S[n-1] = \$\) and \(S[i] > \$\) for all \(i\in \left[ 0 \, ..\,n-1\right) \).

The suffix array \(\texttt {SA}\) of S is an array of length n that contains the indices of the suffixes of S in increasing lexicographical order. That is, \(\texttt {SA}\) forms a permutation of \(\left[ 0 \, ..\,n\right) \) and \(S_{\texttt {SA}[0]}<_ lex S_{\texttt {SA}[1]}<_ lex \dots <_ lex S_{\texttt {SA}[n-1]}\).

Definition 1

(pss -tree [4]). Let \(\texttt {pss}\) be the array such that \(\texttt {pss}[i]\) is the index of the previous smaller suffix for each \(i\in \left[ 0 \, ..\,n\right) \) (or -1 if none exists). Formally, \(\texttt {pss}[i] := \max {\left( \left\{ j\in \left[ 0 \, ..\,i\right) : S_j <_ lex S_i\right\} \cup \left\{ -1\right\} \right) }\). Note that \(\texttt {pss}\) forms a tree with -1 as the root, in which each \(i\in \left[ -1 \, ..\,n\right) \) is represented by a node and \(\texttt {pss}[i]\) is the parent of node i. We call this tree the pss -tree. Further, we impose an order on the nodes that corresponds to the order of the indices represented by the nodes. In particular, if \(c_1<c_2<\dots <c_k\) are the children of i (i.e. \(\texttt {pss}[c_1]=\dots =\texttt {pss}[c_k]=i\)), we say \(c_k\) is the last child of i.

Fig. 1.
figure 1

Shown are the Lyndon prefixes of all suffixes of \(S=\texttt {acedcebceece\$}{}\) and the corresponding suffix array, \(\texttt {nss}\)-array, \(\texttt {pss}\)-array and \(\texttt {pss}\)-tree. Each box indicates a Lyndon prefix. For instance, the Lyndon prefix of \(S_7=\texttt {ceece\$}\) is \(\mathcal {L}_7 = \texttt {cee}\). Note that \(\mathcal {L}_i\) is exactly S[i] concatenated with the Lyndon prefixes of i’s children in the \(\texttt {pss}\)-tree (see Lemma 4), e.g. \(\mathcal {L}_6 = S[6]\mathcal {L}_7\mathcal {L}_{10} = \texttt {bceece}\).

Analogous to \(\texttt {pss}[i]\), we define \(\texttt {nss}[i] := \min {\left\{ j\in \left( i \, ..\,n\right] : S_j<_ lex S_i\right\} }\) as the next smaller suffix of i. Note that \(S_n=\varepsilon \) is smaller than any non-empty suffix of S, hence \(\texttt {nss}\) is well-defined.

In the rest of this paper, we use \(S=\texttt {acedcebceece\$}{}\) as our running example. Figure 1 shows its Lyndon prefixes and the corresponding \(\texttt {pss}\)-tree.

Definition 2

Let \(\mathcal {P}_i\) be the set of suffixes with i as next smaller suffix, that is

$$ \mathcal {P}_i = \left\{ j\in \left[ 0 \, ..\,i\right) : \texttt {nss}[j] = i\right\} $$

For instance, in the example we have \(\mathcal {P}_{4} = \left\{ 1,3\right\} \) because \(\texttt {nss}[1] = \texttt {nss}[3] = 4\).

Fig. 2.
figure 2

A Lyndon grouping of \(\texttt {acedcebceece\$}{}\) with group contexts.

3 GSACA

We start by giving a high level description of the sorting principle based on grouping by Baier [1, 2]. Very basically, the suffixes are first assigned to lexicographically ordered groups, which are then refined until the suffix array emerges. The algorithm consists of the following steps.

  • Initialisation: Group the suffixes according to their first character.

  • Phase I: Refine the groups until the elements in each group have the same Lyndon prefix.

  • Phase II: Sort elements within groups lexicographically.

Definition 3

(Suffix Grouping, adapted from [3]). Let S be a string of length n and \(\texttt {SA}\) the corresponding suffix array. A group \(\mathcal {G}\) with group context \(\alpha \) is a tuple \(\langle g_s,g_e,\left| \alpha \right| \rangle \) with group start \(g_s\in \left[ 0 \, ..\,n\right) \) and group end \(g_e\in \left[ g_s \, ..\,n\right) \) such that the following properties hold:

  1. 1.

    All suffixes in \(\texttt {SA}\left[ g_s \, ..\,g_e\right] \) share the prefix \(\alpha \), i.e. for all \(i\in \texttt {SA}\left[ g_s \, ..\,g_e\right] \) it holds \(S_i=\alpha S_{i+\left| \alpha \right| }\).

  2. 2.

    \(\alpha \) is a Lyndon word.

We say i is in \(\mathcal {G}\) or i is an element of \(\mathcal {G}\) and write \(i\in \mathcal {G}\) if and only if \(i\in \texttt {SA}\left[ g_s \, ..\,g_e\right] \). A suffix grouping for S is a set of groups \(\mathcal {G}_1,\dots ,\mathcal {G}_m\), where the groups are pairwise disjoint and cover the entire suffix array. Formally, if \(\mathcal {G}_i = \langle g_{s,i},g_{e,i},\left| \alpha _i\right| \rangle \) for all i, then \(g_{s,1} = 0, g_{e,m} = n-1\) and \(g_{s,j} = 1 + g_{e,j-1}\) for all \(j\in \left[ 2 \, ..\,m\right] \). For \(i,j\in \left[ 1 \, ..\,m\right] \), \(\mathcal {G}_i\) is a lower (higher) group than \(\mathcal {G}_j\) if and only if \(i<j\) (\(i>j\)). If all elements in a group \(\mathcal {G}\) have \(\alpha \) as their Lyndon prefix then \(\mathcal {G}\) is a Lyndon group. If \(\mathcal {G}\) is not a Lyndon group, it is called preliminary. Furthermore, a suffix grouping is Lyndon if all its groups are Lyndon groups, and preliminary otherwise.

With these notions, a suffix grouping is created in the initialisation, which is then refined in Phase I until it is Lyndon, and further refined in Phase II until the suffix array emerges. Figure 2 shows a Lyndon grouping of our running example.

In Subsects. 3.1 and 3.2 we explain Phases II and I, respectively, of our suffix array construction algorithm. Phase II is described first because it is much simpler.

figure a

3.1 Phase II

In Phase II we need to refine the Lyndon grouping obtained in Phase I into the suffix array. Let \(\mathcal {G}\) be a Lyndon group with context \(\alpha \) and let \(i,j\in \mathcal {G}\). Since \(S_i=\alpha S_{i+\left| \alpha \right| }\) and \(S_j=\alpha S_{j+\left| \alpha \right| }\), we have \(S_i <_ lex S_j\) if and only if \(S_{i+\left| \alpha \right| } <_ lex S_{j+\left| \alpha \right| }\). Hence, in order to find the lexicographically smallest suffix in \(\mathcal {G}\), it suffices to find the lexicographically smallest suffix p in \(\left\{ i+\left| \alpha \right| : i\in \mathcal {G}\right\} \). Note that removing \(p-\left| \alpha \right| \) from \(\mathcal {G}\) and inserting it into a new group immediately preceding \(\mathcal {G}\) yields a valid Lyndon grouping. We can repeat this process until each element in \(\mathcal {G}\) is in its own singleton group. As \(\mathcal {G}\) is Lyndon, we have \(S_{k+\left| \alpha \right| }<_ lex S_k\) for each \(k\in \mathcal {G}\). Therefore, if all groups lower than \(\mathcal {G}\) are singletons, p can be determined by a simple scan over \(\mathcal {G}\) (by determining which member of \(\left\{ i+\left| \alpha \right| : i\in \mathcal {G}\right\} \) is in the lowest group). Consider for instance \(\mathcal {G}_4=\langle 3,4,\left| \texttt {ce}\right| \rangle \) from Fig. 2. We consider \(4+\left| \texttt {ce}\right| = 6\) and \(10+\left| \texttt {ce}\right| =12\). Among them, 12 belongs to the lowest group, hence \(S_{10}\) is lexicographically smaller than \(S_4\). Thus, we know \(\texttt {SA}[3] = 10\) and remove 10 from \(\mathcal {G}_4\) and repeat the process with the emerging group \(\mathcal {G}_4'=\langle 4,4,\left| \texttt {ce}\right| \rangle \). As \(\mathcal {G}_4'\) only contains 4 we know \(\texttt {SA}[4] = 4\).

If the groups are refined from lower to higher as just described, each time a group \(\mathcal {G}\) is processed, all groups lower than \(\mathcal {G}\) are singletons. However, sorting groups in such a way leads to a superlinear time complexity. Bertram et al. [3] provide a fast-in-practice \(\mathcal {O}\left( n\log n\right) \) algorithm for this, broadly following the described approach.

In order to get a linear time complexity, we turn this approach on its head like Baier does [1, 2]: Instead of repeatedly finding the next smaller suffix in a group, we consider the suffixes in lexicographically increasing order and for each encountered suffix i, we move all suffixes that have i as the next smaller suffix (i.e. those in \(\mathcal {P}_i\)) to new singleton groups immediately preceding their respective old groups. Corollary 1 implies that this procedure is well-defined.

Lemma 1

For any \(j,j'\in \mathcal {P}_i\) we have \(\mathcal {L}_j \ne \mathcal {L}_{j'}\) if and only if \(j \ne j'\).

Corollary 1

In a Lyndon grouping, the elements of \(\mathcal {P}_i\) are in different groups.

Accordingly, Algorithm 1 correctly computes the suffix array from a Lyndon grouping. A formal proof of correctness is given in [1, 2]. Figure 3 shows Algorithm 1 applied to our running example.

Fig. 3.
figure 3

Refining a Lyndon grouping for \(S=\texttt {acedcebceece\$}{}\) (see Fig. 2) into the suffix array, as done in Algorithm 1. Inserted elements are colored green. (Color figure online)

Note that each element \(i\in \left[ 0 \, ..\,n-1\right) \) has exactly one next smaller suffix, hence there is exactly one j with \(i\in \mathcal {P}_j\) and thus i is inserted exactly once into a new singleton group in Algorithm 1. Therefore, it suffices to map each group from the Lyndon grouping obtained from Phase I to its current start; we use an array C that contains the current group starts.

There are two major differences between our Phase II and Baier’s, both are concerned with the iteration over the \(\mathcal {P}_i\)-sets.

The first difference is the way in which we determine the elements of \(\mathcal {P}_i\) for some i. The following observations enable us to iterate over \(\mathcal {P}_i\).

Lemma 2

\(\mathcal {P}_i\) is empty if and only if \(i = 0\) or \(S_{i-1} <_ lex S_i\). Furthermore, if \(\mathcal {P}_i\ne \emptyset \) then \(i-1\in \mathcal {P}_i\).

Lemma 3

For some \(j\in \left[ 0 \, ..\,i\right) \), we have \(j\in \mathcal {P}_i\) if and only if j’s last child is in \(\mathcal {P}_i\), or \(j=i-1\) and \(S_j>_ lex S_i\).

Specifically, (if \(\mathcal {P}_i\) is not empty) we can iterate over \(\mathcal {P}_i\) by walking up the pss-tree starting from \(i-1\) and halting when we encounter a node that is not the last child of its parent.Footnote 2 Baier [1, 2] tests whether \(i-1\) (\(\texttt {pss}[j]\)) is in \(\mathcal {P}_i\) by explicitly checking whether \(i-1\) (\(\texttt {pss}[j]\)) has already been written to A using an explicit marker for each suffix. Reading and writing those markers leads to bad cache performance because the accessed memory locations are unpredictable (for the CPU/compiler). Lemmata 2 and 3 enable us to avoid reading and writing those markers. In fact, in our implementation of Phase II, the array A is the only memory written to that is not always in the cache. Lemma 2 tells us whether we need to follow the \(\texttt {pss}\)-chain starting at \(i-1\) or not. Namely, this is the case if and only if \(S_{i-1}>_ lex S_i\), i.e. \(i-1\) is a leaf in the \(\texttt {pss}\)-tree. This information is required when we encounter i in A during the outer for-loop in Algorithm 1, thus we mark such an entry i in A if and only if \(\mathcal {P}_i\ne \emptyset \). Implementation-wise, we use the most significant bit (MSB) of an entry to indicate whether it is marked or not. By definition, we have \(S_{i-1}>_ lex S_i\) if and only if \(\texttt {pss}[i] + 1 < i\). Since \(\texttt {pss}[i]\) must be accessed anyway when i is inserted into A (for traversing the \(\texttt {pss}\)-chain), we can insert i marked or unmarked into A. Further, Lemma 3 implies that we must stop traversing a \(\texttt {pss}\)-chain when the current element is not the last child of its parent. We mark the entries in \(\texttt {pss}\) accordingly, also using the MSB of each entry. In the rest of this paper, we assume \(\texttt {pss}\) to be marked in this way.

Consider for instance \(i=6\) in our running example. As \(6-1=5\) is a leaf (cf. Fig. 1), we have \(5\in \mathcal {P}_6\). We can deduce the fact that 5 is indeed a leaf from \(\texttt {pss}[6]=0 < 5\) alone. Further, 5 is the last child of \(\texttt {pss}[5]=4\), so \(4\in \mathcal {P}_6\). Since 4 is not the last child of \(\texttt {pss}[4]=0\), we have \(\mathcal {P}_6 = \left\{ 4,5\right\} \).

figure b
Fig. 4.
figure 4

Refining a Lyndon grouping for \(S=acedcebceece\$\) (see Fig. 2) into the suffix array using Algorithm 2. Marked entries are coloured blue while inserted but unmarked elements are coloured green. Note that the uncoloured entries are not actually present in the array A but only serve to indicate the current Lyndon grouping. (Color figure online)

The second major change concerns the cache-unfriendliness of traversing the \(\mathcal {P}_i\)-sets. This bad cache performance results from the fact that the next \(\texttt {pss}\)-value (and the group start pointer) cannot be fetched until the current one is in memory. Instead of traversing the \(\mathcal {P}_i\)-sets one after another, we opt to traversing multiple such sets in a sort of breadth-first-search manner simultaneously. Specifically, we maintain a small (\(\le 2^{10}\) elements) queue Q of elements (nodes in the \(\texttt {pss}\)-tree) that can currently be processed. Then we iterate over Q and process the entries one after another. Parents of last children are inserted into Q in the same order as the respective children. After each iteration, we continue to scan over A and for each encountered marked entry i insert \(i-1\) into Q until we either encounter an empty entry in A or Q reaches its maximum capacity. This is repeated until the suffix array emerges. The queue size could be unlimited, but limiting it ensures that it fits into the CPU’s cache. Figure 4 shows our Phase II on the running example and Algorithm 2 describes it formally in pseudo code.

Theorem 1

Algorithm 2 correctly computes the suffix array from a Lyndon grouping.

3.2 Phase I

In Phase I, a Lyndon grouping is derived from a suffix grouping in which the group contexts have length (at least) one. That is, the suffixes are sorted and grouped by their Lyndon prefixes. Lemma 4 describes the relationship between the Lyndon prefixes and the \(\texttt {pss}\)-tree that is essential to Phase I.

Lemma 4

Let \(c_1<\dots <c_k\) be the children of \(i\in \left[ 0 \, ..\,n\right) \) in the \(\texttt {pss}\)-tree. \(\mathcal {L}_i\) is S[i] concatenated with the Lyndon prefixes of \(c_1,\dots ,c_k\). More formally:

$$\begin{aligned} \mathcal {L}_i = S\left[ i \, ..\,\texttt {nss}[i]\right) = S[i]S\left[ c_1 \, ..\,c_2\right) \dots S\left[ c_k \, ..\,\texttt {nss}[i]\right) = S[i]\mathcal {L}_{c_1}\dots \mathcal {L}_{c_k} \end{aligned}$$

We start from the initial suffix grouping in which the suffixes are grouped according to their first characters. From the relationship between the Lyndon prefixes and the \(\texttt {pss}\)-tree in Lemma 4 one can get the general idea of extending the context of a node’s group with the Lyndon prefixes of its children (in correct order) while maintaining the sorting [1]. Note that any node is by definition in a higher group than its parent. Also, by Lemma 4 the leaves of the \(\texttt {pss}\)-tree are already in Lyndon groups in the initial suffix grouping. Therefore, if we consider the groups in lexicographically decreasing order (i.e. higher to lower) and append the context of the current group to each parent (and insert them into new groups accordingly), each encountered group is guaranteed to be Lyndon [1]. Consequently, we obtain a Lyndon grouping. Figure 5 shows this principle applied to our running example. Formally, the suffix grouping satisfies the following property during Phase I before and after processing a group:

Property 1

For any \(i\in \left[ 0 \, ..\,n\right) \) with children \(c_1<\dots <c_k\) there is \(j\in \left[ 0 \, ..\,k\right] \) such that (a) \(c_1,\dots ,c_j\) are in groups that have already been processed, (b) \(c_{j+1},\dots ,c_k\) are in groups that have not yet been processed, and (c) the context of the group containing i is \(S[i]\mathcal {L}_{c_1}\dots \mathcal {L}_{c_j}\). Furthermore, each processed group is Lyndon.

Fig. 5.
figure 5

Refining the initial suffix grouping for \(S=abccabccbcc\$\) (see Fig. 2) into the Lyndon grouping. Elements in Lyndon groups are marked gray or green, depending on whether they have been processed already. Note that the applied procedure does not entirely correspond to our algorithm for Phase I; it only serves to illustrate the sorting principle. (Color figure online)

Additionally and unlike in Baier’s original approach, all groups created during our Phase I are either Lyndon or only contain elements whose Lyndon prefix is different from the group’s context.

Definition 4

(Strongly preliminary group). We call a preliminary group \(\mathcal {G}=\langle g_s,g_e,\left| \alpha \right| \rangle \) strongly preliminary if and only if \(\mathcal {G}\) contains only elements whose Lyndon prefix is not \(\alpha \). A preliminary group that is not strongly preliminary is called weakly preliminary.

Lemma 5

For any weakly preliminary group \(\mathcal {G}=\langle g_s,g_e,\left| \alpha \right| \rangle \) there is some \(g'\in \left[ g_s \, ..\,g_e\right) \) such that \(\mathcal {G}'=\langle g_s,g',\left| \alpha \right| \rangle \) is a Lyndon group and \(\mathcal {G}''=\langle g'+1,g_e,\left| \alpha \right| \rangle \) is a strongly preliminary group.

For instance, in Fig. 5 there is a group containing 1,4 and 10 with context ce. However, 4 and 10 have this context as Lyndon prefix while 1 has ced. Consequently, 1 will later be moved to a new group. Hence, when Baier (and Bertram et al.) create a weakly preliminary group (in Fig. 5 this happens while processing the Lyndon group with context e), we instead create two groups, the lower containing 4 and 10 and the higher containing 1.

During Phase I we maintain the suffix grouping using the following data structures. Two arrays A and I of length n each, where A contains the unprocessed Lyndon groups and the sizes of the strongly preliminary groups, and I maps each element \(s\in \left[ 0 \, ..\,n\right) \) to the start of the group containing s. We call I[s] the group pointer of s. Further, we store the starts of the already processed groups in a list C. Let \(\mathcal {G}=\langle g_s,g_e,\left| \alpha \right| \rangle \) be a group. For each \(s\in \mathcal {G}\) we have \(I[s] = g_s\). If \(\mathcal {G}\) is Lyndon and has not yet been processed, we also have \(s\in A\left[ g_s \, ..\,g_e\right] \) for all \(s\in \mathcal {G}\) and \(A[g_s]<A[g_s+1]< \dots < A[g_e]\). If \(\mathcal {G}\) is Lyndon and has been processed already, there is some j such that \(C[j] = g_s\). If \(\mathcal {G}\) is (strongly) preliminary we have \(A[g_s] = g_e + 1 - g_s\) and \(A[k] = 0\) for all \(k\in \left( g_s \, ..\,g_e\right] \).

There are several reasons why our Phase I is much faster than Baier’s. Firstly, we do not write preliminary groups to A. Secondly, we compute \(\texttt {pss}\) beforehand using an algorithm by Bille et al. [4] instead of on the fly as Baier does [1, 2]. Furthermore, we have the Lyndon groups in A sorted and store the sizes of the strictly preliminary groups in A as well. The former makes finding the number of children a parent has in the currently processed group easier and faster. The latter makes the separate array of length n used by Baier [1, 2] for the group sizes obsolete and is made possible by the fact that we only write Lyndon groups to A. For reasons why these changes lead to a faster algorithm see [13].

As alluded above, we follow Baier’s general approach and consider the Lyndon groups in lexicographically decreasing order while updating the groups containing the parents of elements in the current group. Since children are in higher groups than their parents by definition, when we encounter some group \(\mathcal {G}=\langle g_s,g_e,\left| \alpha \right| \rangle \), the children of any element in \(\mathcal {G}\) are in already processed groups. Hence, by Property 1 \(\mathcal {G}\) must be Lyndon. For a formal proof see [1].

In the rest of this section we explain how to actually process a Lyndon group.

Let \(\mathcal {G}= \langle g_s,g_e,\left| \alpha \right| \rangle \) be the currently processed group and w.l.o.g. assume that no element in \(\mathcal {G}\) has the root \(-1\) as parent (we do not have the root in the suffix grouping, thus nodes with the root as parent can be ignored here). Furthermore, let \(\mathcal {A}\) be the set of parents of elements in \(\mathcal {G}\) (i.e. \(\mathcal {A}= \left\{ \texttt {pss}[i] : i\in \mathcal {G}, \texttt {pss}[i] \ge 0\right\} \)) and let \(\mathcal {G}_1<\dots <\mathcal {G}_k\) be those (necessarily preliminary) groups containing elements from \(\mathcal {A}\). For each \(g\in \left[ 1 \, ..\,k\right] \) let \(\alpha _g\) be the context of \(\mathcal {G}_g\).

As noted in Fig. 5, we have to consider the number of children an element in \(\mathcal {A}\) has in \(\mathcal {G}\). Specifically, we need to move two parents in \(\mathcal {A}\) which are currently in the same group to different new groups if they have differing numbers of children in \(\mathcal {G}\). Let \(\mathcal {A}_\ell \) contain those elements from \(\mathcal {A}\) with exactly \(\ell \) children in \(\mathcal {G}\). Maintaining Property 1 requires that, after processing \(\mathcal {G}\), for some \(g\in \left[ 1 \, ..\,k\right] \) the elements in \(\mathcal {G}_g\cap \mathcal {A}_\ell \) are in groups with context \(\alpha _g\alpha ^\ell \). For any \(\ell <\ell '\), we have \(\alpha _g\alpha ^\ell <_ lex \alpha _g\alpha ^{\ell '}\), thus the elements in \(\mathcal {G}_g\cap \mathcal {A}_\ell \) must form a lower group than those in \(\mathcal {G}_g\cap \mathcal {A}_{\ell '}\) after \(\mathcal {G}\) has been processed [1, 2]. To achieve this, first the parents in \(\mathcal {A}_{\left| \mathcal {G}\right| }\) are moved to new groups immediately following their respective old groups, then those in \(\mathcal {A}_{\left| \mathcal {G}\right| -1}\) and so on [1, 2].

We proceed as follows. First, determine \(\mathcal {A}\) and count how many children each parent has in \(\mathcal {G}\). Then, sort the parents according to these counts using a bucket sort.Footnote 3 Further, partition the elements in each bucket into two sub-buckets depending on whether they should be inserted into Lyndon groups or strongly preliminary groups. Then, for the sub-buckets (in the order of decreasing count; for equal counts: first strongly preliminary then Lyndon sub-buckets) move the parents into new groups.Footnote 4 Because of space constraints, we do not describe the rather technical details. These can be found in the extended paper [13].

4 Experiments

Our implementation FGSACA of the optimised GSACA is publicly available.Footnote 5

We compare our algorithm with the GSACA implementation by Baier [1, 2], and the double sort algorithms DS1 and DSH by Bertram et al. [3]. The latter two also use the grouping principle but employ integer sorting and have super-linear time complexity. DSH differs from DS1 only in the initialisation: in DS1 the suffixes are sorted by their first character while in DSH up to 8 characters are considered. We further include DivSufSort 2.0.2 and libsais 2.7.1 since the former is used by Bertram et al. as a reference [3] and the latter is the currently fastest suffix sorter known to us.

Fig. 6.
figure 6

Normalised running time and working memory averaged for each category. The horizontal red line indicates the time for libsais. For Large we did not test GSACA because Baier’s reference implementation only supports 32-bit words. (Color figure online)

The algorithms were evaluated on real texts (in the following PC-Real), real repetitive texts (PC-Rep-Real) and artificial repetitive texts (PC-Rep-Art) from the Pizza & Chili corpus. To test the algorithms on texts for which a 32-bit suffix array is not sufficient, we also included larger texts (Large), namely the first \(10^{10}\) bytes from the English Wikipedia dump from 01.06.2022 and the human DNA concatenated with itself. For more detail on the data and our testing methodology see the longer version of this paper [13].

All algorithms were faster on the more repetitive datasets, on which the differences between the algorithms were also smaller. On all datasets, our algorithm is between \(46\%\) and \(60\%\) faster than GSACA and compared to DSH about \(2\%\) faster on repetitive data, over \(11\%\) faster on PC-Real and over \(13\%\) faster on Large.

Especially notable is the difference in the time required for Phase II: Our Phase II is between \(33\%\) and \(50\%\) faster than Phase II of DSH. Our Phase I is also faster than Phase I of DS1 by a similar margin. Conversely, Phase I of DSH is much faster than our Phase I. However, this is only due to the more elaborate construction of the initial suffix grouping as demonstrated by the much slower Phase I of DS1. Compared to FGSACA, libsais is between 46% and 3% faster.

Memory-wise, for 32-bit words, FGSACA uses about 8.83 bytes per input character, while DS1 and DSH use 8.94 and 8.05 bytes/character, respectively. GSACA always uses 12 bytes/character. On Large, FGSACA expectedly requires about twice as much memory. For DS1 and DSH this is not the case, mostly because they use 40-bit integers for the additional array of length n that they require (while we use 64-bit integers). DivSufSort requires only a small constant amount of working memory and libsais never exceeded 21kiB of working memory on our test data.