1 Introduction

A non-empty string w is said to be an absent word (a.k.a. a forbidden word) for a string S if w is not a substring of S. An absent word w for S is said to be a minimal absent word (MAW) for S if all proper substrings of w occur in S. For instance, for string \(S = \texttt{bbacccbaa}\) over an alphabet \(\varSigma = \{\texttt{a}, \texttt{b}, \texttt{c}, \texttt{d}\}\), the set \(\textsf{MAW}(S)\) of all MAWs for S is \(\{\texttt{aaa},\texttt{bbb},\texttt{cccc},\texttt{d}, \texttt{ab},\texttt{ca},\texttt{bc}, \texttt{aac},\texttt{acb},\texttt{cbb},\texttt{accb},\) \(\texttt{cbac}, \texttt{bbaa}\}\). MAWs are combinatorial string objects, and their interesting mathematical properties have extensively been studied in the literature (see [1, 7, 16, 17, 19, 23] and references therein). MAWs also enjoy several applications including phylogeny [11], data compression [3, 15, 18], musical information retrieval [14], and bioinformatics [2, 12, 22, 24].

It is known that the number \(|\textsf{MAW}(S)|\) of MAWs for a string S of length n over an alphabet of size \(\sigma \) is \(O(\sigma n)\) and that this bound is tight [17]. Crochremore et al. [17] gave an algorithm that computes \(\textsf{MAW}(S)\) in \(O(\sigma n)\) time with O(n) working space. Fujishige et al. [20] showed an improved algorithm for computing \(\textsf{MAW}(S)\) in optimal \(O(n + |\textsf{MAW}(S)|)\) time with O(n) working space, for an input string S of length n over an integer alphabet of polynomial size in n. Both of the two aforementioned algorithms utilize an O(n)-size string data structure called the (directed acyclic word graph) DAWG [9], which recognizes the set of substrings of S, and can be built in \(O(n \log \sigma )\) time for general ordered alphabets [9], and in O(n) time for integer alphabets of polynomial size in n [20]. There also exist other efficient algorithms for computing MAWs with other string data structures such as suffix arrays and Burrows-Wheeler transforms [4, 8].

The aim of this paper is to extend the notion of MAWs to a set \(\mathcal {S} = \{S_1, \ldots , S_k\}\) of multiple k strings. We are aware of a few related attempts in earlier work: Chairungsee and Crochemore [11] introduced a string similarity measure based on the symmetric difference \(\textsf{MAW}(S_1) \bigtriangleup \textsf{MAW}(S_2)\) of the sets of MAWs for two strings \(S_1\) and \(S_2\) to compare. They introduced a length threshold \(\ell \ge 1\), and described an approach for computing \(\left( \textsf{MAW}(S_1) \bigtriangleup \textsf{MAW}(S_2) \right) \cap \varSigma ^\ell \) with the two following steps: First, the tries of size \(O(n \ell )\) each representing the substrings of \(S_1\) and \(S_2\) of length up to \(\ell \) are built, where \(n = |S_1| + |S_2|\). Then, two tries each representing \(\textsf{MAW}(S_1) \cap \varSigma ^\ell \) and \(\textsf{MAW}(S_2) \cap \varSigma ^\ell \) are built, which require \(O(n \sigma )\) space. Finally, the length-bounded symmetric difference \(\left( \textsf{MAW}(S_1) \bigtriangleup \textsf{MAW}(S_2) \right) \cap \varSigma ^\ell \) is computed from \(\textsf{MAW}(S_1) \cap \varSigma ^\ell \) and \(\textsf{MAW}(S_2) \cap \varSigma ^\ell \), but the authors did not explicitly describe how this computation is done in their method. Overall, their algorithm requires \(\Omega (n (\ell + \sigma ))\) time and space [11]Footnote 1. Charalampapaulose et al. [12] tackled the same problem of computing the symmetric difference \(\textsf{MAW}(S_1) \bigtriangleup \textsf{MAW}(S_1)\) (without length threshold \(\ell \)), and proposed a solution that requires \(O(\sigma n)\) time and space. Their method firstly computes \(\textsf{MAW}(S_1)\) and \(\textsf{MAW}(S_2)\) separately, and then removes the elements that are in \(\textsf{MAW}(S_1) \cap \textsf{MAW}(S_2)\). Charalampopoulos, Crochemore, and Pissis [13] presented how to count the number \(|\textsf{MAW}(S_1) \bigtriangleup \textsf{MAW}(S_2)|\) of elements in the symmetric difference \(\textsf{MAW}(S_1) \bigtriangleup \textsf{MAW}(S_2)\) in O(n) time in the case of integer alphabets of polynomial size in n, by avoiding to list the elements explicitly.

Let \(\mathcal {S} = \{S_1, \ldots , S_k\}\) be the input set of k strings, and \(\textbf{B}\in \{0,1\}^k\) be a given bit vector of length k. Our problem is to list (generalized) MAWs w for \(\mathcal {S}\) and \(\textbf{B}\) such that \(w \in \textsf{MAW}(S_i)\) for every \(\textbf{B}[i] = 1\), and \(w \notin \textsf{MAW}(S_i)\) for every \(\textbf{B}[i] = 0\). For \(k = 2\), the aforementioned problem of computing \(\textsf{MAW}(S_1) \bigtriangleup \textsf{MAW}(S_2)\) is equivalent to solving our problem for \(\textbf{B}= 01\) and \(\textbf{B}= 10\). In Sect. 4 and Sect. 5, we deal with the case with \(k = 2\), and present an algorithm running in \(O(n+ |\textsf{M}_{\textbf{B}}|)\) time with O(n) working space, where \(\textsf{M}_\textbf{B}\) denotes the set of (generalized) MAWs to output for a given bit vector \(\textbf{B}\) (Theorem 2). This immediately gives us an algorithm for listing the elements of the symmetric difference \(\textsf{MAW}(S_1) \bigtriangleup \textsf{MAW}(S_2)\) in optimal \(O(n + |\textsf{MAW}(S_1) \bigtriangleup \textsf{MAW}(S_2)|)\) time (Corollary 1). In Sect. 6, we deal with the general case of \(k > 2\), and extend our solution for \(k = 2\) to the general case. Let n be the total length of the input k strings in \(\mathcal {S}\). Our solution for general \(k > 2\) works in \(O(n \lceil k / \log n \rceil + |\textsf{M}_{\textbf{B}}|)\) time with \(O(n(k+\log n))\) bits of working space on the word RAM model with machine word size \(\omega = \log n\). Thus, for \(k = O(\log n)\), our algorithm runs in optimal \(O(n + |\textsf{M}_{\textbf{B}}|)\) time. All the bounds claimed in this paper are valid for linearly sortable alphabets, including integer alphabets of polynomial size in n.

As in the previous work [17, 20, 21], our key data structure is the DAWG for the input set \(\mathcal {S}\) of strings. The best-known algorithm for constructing the DAWG for a set of strings of total length n takes \(O(n \log \sigma )\) time [10], thus it can require \(O(n \log n)\) time for large alphabets. We describe how the DAWG for a given set \(\mathcal {S}\) of strings over an integer alphabet of polynomial size in n can be obtained in optimal O(n) time (Theorem 1), which may be of independent interest.

2 Preliminaries

Strings. Let \(\varSigma \) be an ordered alphabet. An element of \(\varSigma \) is called a character. For characters \(a,b \in \varSigma \), we write \(a \prec b\) (or equivalently \(b \succ a\)) if a is lexicographically smaller than b. An element of \(\varSigma ^*\) is called a string. The length of a string S is denoted by |S|. The empty string \(\varepsilon \) is the string of length 0. If \(S = xyz\), then x, y, and z are called a prefix, substring, and suffix of S, respectively. They are called a proper prefix, proper substring, and proper suffix of S if \(x \ne S\), \(y \ne S\), and \(z \ne S\), respectively. Let \(\textsf{Substr}(S)\) denote the set of substrings of string S. For any \(1 \le i \le |S|\), the i-th character of S is denoted by S[i]. For any \(1 \le i \le j \le |S|\), S[i..j] denotes the substring of S starting at i and ending at j. For convenience, let \(S[i..j] = \varepsilon \) for \(0 \le j < i \le |S|+1\). We say that a string w occurs in a string S iff w is a substring of S. Note that by definition the empty string \(\varepsilon \) is a substring of any string S and hence \(\varepsilon \) always occurs in S.

For a set \(\mathcal {S}\) of strings, let \(\Vert \mathcal {S} \Vert \) denote the total length of the strings in \(\mathcal {S}\), that is, \(\Vert \mathcal {S} \Vert = \sum _{S \in \mathcal {S}}|S|\). Let \(\textsf{Substr}(\mathcal {S})\) denote the set of substrings of the strings in \(\mathcal {S}\), that is, \(\textsf{Substr}(\mathcal {S}) = \left( \bigcup _{S \in \mathcal {S}}\{S[i..j] \mid 1 \le i \le j \le |S|\} \right) \cup \{\varepsilon \}\).

Minimal Absent Words (MAWs). A string w is called an absent word for a string S if w does not occur in S. Let \(\textsf{AW}(S) = \varSigma ^* \setminus \textsf{Substr}(S)\) denote the set of absent words for a string S. An absent word \(w \in \textsf{AW}(S)\) for string S is called a minimal absent word or MAW for S if any proper substring of w occurs in S. We denote by \(\textsf{MAW}(S)\) the set of all MAWs for S. Let \(\textsf{nonMAW}(S) = \textsf{AW}(S) \setminus \textsf{MAW}(S)\) be the set of absent words for S which are not MAWs. Note that, for strings w and S, it holds that \(w \notin \textsf{MAW}(S)\) iff \(w \in \textsf{Substr}(S) \cup \textsf{nonMAW}(S)\).

We extend the aforementioned notion of MAWs to a set \(\mathcal {S} = \{S_1, \ldots , S_k\}\) of k strings for \(k \ge 1\), as follows: Let \(\textbf{B}\) be a bit-vector of length k, and let \(\mathcal {S}_{\textbf{B}}\) be a subset of \(\mathcal {S}\) such that \(\mathcal {S}_{\textbf{B}} = \{S_i \mid \textbf{B}[i] = 1\}\). Let \(\overline{\mathcal {S}_{\textbf{B}}} = \{S_i \mid \textbf{B}[i] = 0\} = \mathcal {S} \setminus \mathcal {S}_{\textbf{B}}\). A string w is said to be a MAW for \(\mathcal {S}_{\textbf{B}}\) if (1) \(w \in \bigcap _{S_i \in \mathcal {S}_{\textbf{B}}}\textsf{MAW}(S_i)\) and (2) \(w \notin \bigcup _{S_i \in \overline{\mathcal {S}_{\textbf{B}}}}\textsf{MAW}(S_i)\). Condition (1) implies that w is a MAW for any string in \(\mathcal {S}_{\textbf{B}}\). Condition (2) implies that w is not a MAW for any string in \(\overline{\mathcal {S}_{\textbf{B}}}\), which is equivalent to say that \(w \in \bigcap _{S_i \in \overline{\mathcal {S}_{\textbf{B}}}}(\textsf{Substr}(S_i) \cup \textsf{nonMAW}(S_i))\). Let \(\textsf{MAW}(\mathcal {S}_{\textbf{B}})\) be the set of all MAWs for \(\mathcal {S}_{\textbf{B}}\). Here is some example: For string set \(\mathcal {S} = \{\texttt{abaab}, \texttt{aacbba}\}\) over the alphabet \(\varSigma = \{\texttt{a, b, c, d}\}\), \(\textsf{MAW}(\mathcal {S}_{10}) = \{\texttt{aaba}, \texttt{bab}, \texttt{bb}, \texttt{c}\}\), \(\textsf{MAW}(\mathcal {S}_{01}) = \{\texttt{ab}, \texttt{baa}, \texttt{bac}, \texttt{bbb}, \texttt{bc}, \texttt{ca}, \texttt{cba}, \texttt{cc}\}\), and \(\textsf{MAW}(\mathcal {S}_{11}) = \{\texttt{aaa, d}\}\).

The problem we consider in this paper is the following:

Problem 1 (MAWs for multiple input strings)

Given a set \(\mathcal {S} = \{S_1, \dots , S_k\}\) of k strings over an alphabet \(\varSigma \) and a bit vector \(\textbf{B}\) of length k, compute \(\textsf{MAW}(\mathcal {S}_{\textbf{B}})\).

3 The DAWG Data Structure

We use the directed acyclic word graph (DAWG) [9] data structure for a set \(\mathcal {S} = \{S_1, \ldots , S_k\}\) of k strings, which is a DFA of size \(O(\Vert \mathcal {S} \Vert )\) that recognizes all suffixes of the strings in \(\mathcal {S}\).

To give a formal definition of \(\textsf{DAWG}(\mathcal {S})\), let \(\mathsf {End\_Pos}_{\mathcal {S}}(w)\) denote the set of ending positions of all occurrences of a string w in the strings of \(\mathcal {S}\), that is,

$$\mathsf {End\_Pos}_{\mathcal {S}}(w) = \{(i, j) \mid S_i[j-|w|+1..j] = w, 1 \le i \le k, 1 \le j \le |S_i|\}. $$

We consider an equivalence relation \(\equiv _{\mathcal {S}}\) of strings over \(\varSigma \) w.r.t. \(\mathcal {S}\) such that, for any two strings w and u, \(w \equiv _{\mathcal {S}} u\) iff \(\mathsf {End\_Pos}_{\mathcal {S}}(w) = \mathsf {End\_Pos}_{\mathcal {S}}(u)\). For any string \(x \in \varSigma ^*\), let \([x]_{\mathcal {S}}\) denote the equivalence class for x w.r.t. \(\equiv _{\mathcal {S}}\). All the non-substrings \(x \notin \textsf{Substr}(\mathcal {S})\) form a unique equivalence class, called the degenerate class.

Definition 1

The DAWG of a set \(\mathcal {S}\) of strings, denoted \(\textsf{DAWG}(\mathcal {S})\), is an edge-labeled DAG (VE) such that

$$\begin{aligned} V= & {} \{[{x}]_{\mathcal {S}} \mid x \in \textsf{Substr}(\mathcal {S})\}, \\ E= & {} \{([{x}]_{\mathcal {S}}, b, [{xb}]_{\mathcal {S}}) \mid x, xb \in \textsf{Substr}(\mathcal {S}), b \in \varSigma \}. \end{aligned}$$

We also define the set L of suffix links of \(\textsf{DAWG}(\mathcal {S})\) by

$$ L = \{([{ax}]_{\mathcal {S}}, a, [{x}]_{\mathcal {S}}) \mid x, ax \in \textsf{Substr}(\mathcal {S}), a \in \varSigma , [{ax}]_{\mathcal {S}} \ne [{x}]_{\mathcal {S}} \}. $$

Namely, two substrings x and y in \(\textsf{Substr}(\mathcal {S})\) are represented by the same node of \(\textsf{DAWG}(\mathcal {S})\) iff the ending positions of x and y in the strings of \(\mathcal {S}\) are equal. Note that \(\textsf{DAWG}(\mathcal {S})\) does not contain the node for the degenerate class nor its in-coming edges. This is important for \(\textsf{DAWG}(\mathcal {S})\) to have a total linear number of edges [9], and for our linear-time algorithm for listing all the MAWs for a given query.

For convenience, assume that each string \(S_i\) in \(\mathcal {S} = \{S_1, \ldots , S_k\}\) terminates with a unique end-marker \(\#_i\) which does not occur elsewhere, where \(\#_i \ne \#_j\) for \(i \ne j\). Then \(\textsf{DAWG}(\mathcal {S})\) has exactly k sink nodes, each of which recognizes all the non-empty suffixes of \(S_i\). For each \(1 \le i \le k\), the sink that recognizes the suffixes of \(S_i\) is labeled by i.

The DAWG for a single string T is the DAWG for a singleton \(\{T\}\) and is denoted by \(\textsf{DAWG}(T)\).

The state-of-the-art algorithm that builds \(\textsf{DAWG}(\mathcal {S})\) is Blumer et al.’s online algorithm [9] which runs in \(O(n \log \sigma )\) time with O(n) space, where \(n = \Vert \mathcal {S} \Vert \) is the total length of the strings in \(\mathcal {S}\) and \(\sigma \) is the alphabet size. Below we describe a faster construction of \(\textsf{DAWG}(\mathcal {S})\) in the case of integer alphabets:

Theorem 1 (Linear-time DAWG construction for a set of strings)

For a given set \(\mathcal {S} = \{S_1, \ldots , S_k\}\) of k strings of total length n over an integer alphabet \(\varSigma \) of polynomial size in n, one can build the edge-sorted \(\textsf{DAWG}(\mathcal {S})\) in O(n) time and space.

Proof

We first create a concatenated string \(T = S_1 \cdots S_k\) of total length n from the strings in \(\mathcal {S}\). We build \(\textsf{DAWG}(T)\) for the single string T in O(n) time and space, using the algorithm of Fujishige et al. [20, 21], where the out-going edges of every node are lexicographically sorted. Our goal is to convert \(G_T = \textsf{DAWG}(T)\) to \(G_{\mathcal {S}} = \textsf{DAWG}(\mathcal {S})\). For a set P of integer pairs and a pair (ab) of integers, let \(P \oplus (a,b) = \{(p+a, q+b) \mid (p,q) \in P\}\). Our key observation is that, for any substrings \(w \in \textsf{Substr}(\mathcal {S})\) that do not contain separators \(\#_i\) except for their last positions, it holds that

$$\begin{aligned}{} & {} {\mathsf {End\_Pos}_{\mathcal {S}}(w)} \nonumber \\{} & {} \, = \mathsf {End\_Pos}_{S_1}(w) \cup \left( \bigcup _{2 \le i \le k} \mathsf {End\_Pos}_{S_i}(w) \oplus (i-1, |S_1 \cdots S_{i-1}|) \right) . \end{aligned}$$
(1)

Equation (1) implies that the substrings w of \(T = S_1 \cdots S_{k}\) which are also substrings of \(\mathcal {S}\) are represented by essentially the same nodes in \(G_T\) and in \(G_{\mathcal {S}}\), meaning that there is an injection from the nodes of \(G_{\mathcal {S}}\) to the nodes of \(G_T\).

What is left is how to remove the redundant nodes in \(G_{T}\) which represent the substrings y of T containing a separator \(\#_i\) inside, which are thus not substrings of \(\mathcal {S}\). Let us call the longest path of \(G_{T}\) that represents T as the spine. Since each \(\#_i\) occurs exactly once in T, any substrings of T that contain \(\#_i\) are represented by the spine of \(G_T\). Thus, we can obtain \(G_{\mathcal {S}}\) by removing the redundant nodes from the spine of \(G_T\), but we ensure that for every i the suffixes of \(S_i\) ending with \(\#_i\) are still represented in the graph. This can be achieved as follows: We process \(i = k, \ldots , 2\) in decreasing order. We first split the spine into two parts each spelling out \(S_1 \cdots S_{k-1}\) and \(S_k\). We remove the nodes in the \(S_k\) part which are not reachable from the source of the modified graph, together with their out-going edges and suffix links. This gives us \(\textsf{DAWG}(\{S_1 \cdots S_{k-1}, S_{k}\})\). After processing \(i = k\), we continue the same process for \(i = k-1\) with the remaining spine that spells out \(S_1 \cdots S_{k-1}\). After processing \(i = 2\), we obtain \(G_{\mathcal {S}} = \textsf{DAWG}(\mathcal {S})\). See Fig. 1 for an example of our construction. It is trivial that all the redundant nodes can be removed in O(n) time.    \(\square \)

We remark that the order of concatenating the strings in \(\mathcal {S}\) does not affect the correctness nor the complexity of our algorithm.

Fig. 1.
figure 1

Illustration for our linear-time construction of \(\textsf{DAWG}(\mathcal {S})\) for a set \(\mathcal {S} = \{abc\#_1, bbac\#_2, abca\#_3\}\) of strings. We first build \(\textsf{DAWG}(T)\) for the concatenated string \(T = abc\#_1bbac\#_2abca\#_3\). Then, we remove the redundant nodes in the spine of the DAWG for \(i = 3\) and then for \(i = 2\). This gives us \(\textsf{DAWG}(\mathcal {S})\).

4 Algorithm Overview for \(k = 2\)

In what follows, we consider the case where our input set \(\mathcal {S}\) consists of two strings \(S_1\) and \(S_2\) which respectively terminate with special characters \(\#_1\) and \(\#_2\). We show how, given a bit vector \(\textbf{B}\in \{00, 01, 10, 11\}\) of length 2, we can compute \(\textsf{MAW}(\mathcal {S}_{\textbf{B}})\) in \(O(n + |\textsf{MAW}(\mathcal {S}_{\textbf{B}})|)\) time and O(n) working space, where \(n = \Vert \mathcal {S} \Vert \).

We first build the edge-sorted \(\textsf{DAWG}(\mathcal {S})\) for a given \(\mathcal {S} = \{S_1, S_2\}\) in O(n) time and space with Theorem 1. We label each node v of \(\textsf{DAWG}(\mathcal {S})\) by \(\#_i\) iff v represents a substring of \(S_i\) (\(1 \le i \le 2\)). Let \(\textsf{label}(v) \in \{\#_1, \#_2, \#_1\#_2\}\) denote the label of node v. The labels of all nodes can be precomputed in O(n) time.

Our algorithm is based on Fujishige et al.’s algorithm [20, 21] for computing all the MAWs in the case of a single input string. As such, for each node x of \(\textsf{DAWG}(\mathcal {S})\) we focus on the shortest string represented by x and denote it by au, where \(a \in \varSigma \) and \(u \in \varSigma ^*\). We use the suffix link of the node x and its target node y whose longest member is u (namely, the first letter a of au is removed by following the suffix link from x to y). For ease of explanation, we identify the node x with the string au, and the node y with the string u.

Fujishige et al.’s algorithm compares the out-going edges of au and those of u one by one in the sorted order. Suppose au has an out-going edge labeled b. If u does not have an out-going edge labeled b, then their algorithm outputs aub as a MAW for the input string. Otherwise, it outputs nothing, and the cost is charged to the out-going edge of au labeled b. Each MAW aub in the output is encoded by a tuple (aij) such that \(w[i..j] = ub\), thus taking O(1) space. This is how Fujishige et al.’s algorithm works in \(O(n + |\textsf{MAW}(S)|)\) time and with O(n) working space for a single string S.

However, in our case of multiple strings, depending on the label of nodes au, aub and ub, and depending on the value of the given bit vector \(\textbf{B}\), there may exist some edge comparisons that cannot be charged either to the output MAWs or to the out-going edges of node au. It is also possible that even if there is a node representing aub in \(\textsf{DAWG}(\mathcal {S})\), still aub is a MAW for some string(s) in \(\mathcal {S}\). To overcome these difficulties, we introduce skip links that permit us to avoid unwanted edge character comparisons.

5 Skip Links for \(k = 2\)

We use the same conventions for the nodes au, aub and u on \(\textsf{DAWG}(\mathcal {S})\) as in the previous section, and also consider the node ub. We have three possible cases for the label of node au, where \(\textsf{label}(au) = \#_1\#_2\), \(\textsf{label}(au) = \#_1\), or \(\textsf{label}(au) = \#_2\). In each of the three cases, there are some sub-cases for the labels of node aub and node ub. By inspection, we obtain all the possible cases that need to be considered, as shown in Fig. 2.

When \(\textbf{B}= 00\), then since \(\textsf{MAW}(\mathcal {S}_{00}) = \varSigma ^* \setminus (\textsf{MAW}(S_1) \cup \textsf{MAW}(S_2))\), there are no MAWs to output. In what follows, we describe our solutions to the cases with \(\textbf{B}\in \{10, 11\}\). We remark that the case with \(\textbf{B}= 01\) is symmetric to the case with \(\textbf{B}= 10\).

Fig. 2.
figure 2

Left: All possible cases of the labels of the nodes au, aub, and ub, and their corresponding bit vectors \(\textbf{B}\). “absent” refers to the case where there is no out-going edge labeled b from node au. The cells with “-” refer to impossible combinations of node labels. Middle: Illustration for \(\textsf{DAWG}(\mathcal {S})\) which shows the case where au is labeled \(\#_1\#_2\), aub is labeled \(\#_1\), and ub is labeled \(\#_1\#_2\). In this case aub is a MAW in \(\textsf{MAW}(\mathcal {S}_{\textbf{B}})\) with \(\textbf{B}= 01\) (see the left table). Right: The regions corresponding to the bit vectors \(\textbf{B}\in \{00, 01, 10, 11\}\).

5.1 When \(\textbf{B}= 10\)

There are four cases in which we output aub as a MAW for \(\textsf{MAW}(\mathcal {S}_{10})\) (see the table on the left of Fig. 2):

  1. (1)

    \(\textsf{label}(au) = \#_1\#_2\), \(\textsf{label}(aub) = \#_2\), and \(\textsf{label}(ub) = \#_1\#_2\);

  2. (2)

    \(\textsf{label}(au) = \#_1\#_2\), \(aub \in \textsf{AW}(\mathcal {S})\), and \(\textsf{label}(ub) = \#_1\);

  3. (3)

    \(\textsf{label}(au) = \#_1\), \(aub \in \textsf{AW}(\mathcal {S})\), and \(\textsf{label}(ub) = \#_1\);

  4. (4)

    \(\textsf{label}(au) = \#_1\), \(aub \in \textsf{AW}(\mathcal {S})\), and \(\textsf{label}(ub) = \#_1\#_2\).

When \(\boldsymbol{\textsf{label}(au)} = \boldsymbol{\#}_{\boldsymbol{1}} \boldsymbol{\#}_{\boldsymbol{2}}\). We create skip links that simultaneously manage Cases (1) and (2), both having \(\textsf{label}(au) = \#_1 \#_2\). We create a selected list \(\textsf{schar}(u)\) of out-going edge labels of node u such that \(\textsf{schar}(u) = \{b \mid \textsf{label}(ub) = \#_1\}\), where the elements are lexicographically sorted. Let \(\textsf{char}(au)\) be the sorted list of all out-going edge labels of node au. For any list L of characters and any character \(c \in \varSigma \), let \(\textsf{succ}(c, L)\) denote the lexicographical successor of c in L. Our algorithm for \(\textbf{B}= 10\) and \(\textsf{label}(au) = \#_1\#_2\) is described in Algorithm 1.

figure a
Fig. 3.
figure 3

Illustration for our algorithm for \(\textbf{B}= 10\) and \(\textsf{label}(au) = \#_1 \#_2\). The white cells in the table show the cases where we output elements of \(\textsf{MAW}(\mathcal {S}_{10})\). We compare the labels of the selected out-going edges of node au and u which are connected by the skip links, in sorted order. In this diagram, aud and aui are output in line 12 and auc and aue are output in line 12 of Algorithm 1 as elements of \(\textsf{MAW}(\mathcal {S}_{10})\).

When \(\boldsymbol{\textsf{label}(au)} = \boldsymbol{\#}_{\boldsymbol{1}}\). We create skip links that simultaneously manage Cases (3) and (4), both having \(\textsf{label}(au) = \#_1\). We create another selected list \(\textsf{schar}'(u)\) of out-going edge labels of node u such that \(\textsf{schar}'(u) = \{b \mid \textsf{label}(ub) \in \{\#_1, \#_1\#_2\}\}\), where the elements are lexicographically sorted. We use the same \(\textsf{char}(au)\) in the previous case. Our algorithm for \(\textbf{B}= 10\) and \(\textsf{label}(au) = \#_1\) is described in Algorithm 2.

figure b

Lemma 1

(Linear-time MAW computation for \(\textbf{B}= 10\)). Given \(\textbf{B}= 10\), one can compute \(\textsf{MAW}(\mathcal {S}_{10})\) in \(O(n+|\textsf{MAW}(\mathcal {S}_{10})|)\) time and O(n) working space for integer alphabets of polynomial size in \(n = \Vert \mathcal {S} \Vert \).

Proof

We run Algorithm 1 and Algorithm 2 for every node au of \(\textsf{DAWG}(\mathcal {S})\).

In the preprocessing phase, we build the edge-sorted \(\textsf{DAWG}(\mathcal {S})\) in \(O(\Vert \mathcal {S} \Vert )\) time and space by Theorem 1. Since the out-going edges of every node are sorted, we can easily compute the sorted lists \(\textsf{char}(au)\), \(\textsf{schar}(u)\), \(\textsf{schar}'(u)\), and \(\textsf{schar}''(u)\) for all nodes in O(n) total time.

Let us consider the complexity of the scanning phase of Algorithm 1. Each edge-label comparison that falls into “\(\hat{b} = b\)” in line 6 of Algorithm 1 is associated either to the reported MAW aub if \(\textsf{label}(aub) = \#_2\) and \(\textsf{label}(ub) = \#_1 \#_2\) (in line 7 and line 8), or to the out-going edge of node au labeled b otherwise. Each edge-label comparison that falls into “\(\hat{b} \succ b\)” in line 11 is associated to the reported MAW aub in line 12. This ensures the desired time complexity for Algorithm 1. The complexity for Algorithm 2 is similar to show.

The correctness of Algorithm 1 and Algorithm 2 is immediate from the tables in Fig. 2 and 3.    \(\square \)

5.2 When \(\textbf{B}= 11\)

There is a single case in which we output aub as a MAW for \(\textsf{MAW}(\mathcal {S}_{11})\) (see Fig. 2): \(\textsf{label}(au) = \#_1\#_2\), \(aub \in \textsf{AW}(\mathcal {S})\), and \(\textsf{label}(ub) = \#_1\#_2\).

Unwanted comparisons can occur here if \(aub \in \textsf{AW}(\mathcal {S})\), and \(\textsf{label}(ub) = \#_1\) or \(\textsf{label}(ub) = \#_2\). To avoid such comparisons, we consider another carefully selected list \(\textsf{schar}''(u)\) of out-going edge labels of node u such that \(\textsf{schar}''(u) = \{b \mid \textsf{label}(ub) = \#_1\#_2\}\), where the elements are lexicographically sorted. We can use the same \(\textsf{char}(au)\) in the previous subsection.

We can modify Algorithm 2 for \(\textbf{B}= 01\) with \(\textsf{label}(au) = \#_1\) so that the modified algorithm computes MAWs for \(\textbf{B}= 11\), only by using \(\textsf{schar}''(u)\) in place of \(\textsf{schar}'(u)\). This leads us to the following lemma:

Lemma 2

(Linear-time MAW computation for \(\textbf{B}= 11\)). Given \(\textbf{B}= 11\), one can compute \(\textsf{MAW}(\mathcal {S}_{11})\) in \(O(n+|\textsf{MAW}(\mathcal {S}_{11})|)\) time and O(n) working space for integer alphabets of polynomial size in \(n = \Vert \mathcal {S} \Vert \).

5.3 Our Main Result for \(k = 2\)

Finally we obtain the main result for a case of two strings with \(k = 2\).

Theorem 2

(Linear-time MAW computation for a set of two strings). Given a set \(\mathcal {S} = \{S_1, S_2\}\) of two strings of total length n and a bit vector \(\textbf{B}\in \{01, 10, 11\}\), one can compute \(\textsf{MAW}(\mathcal {S}_{\textbf{B}})\) in \(O(n+|\textsf{MAW}(\mathcal {S}_{\textbf{B}})|)\) time and O(n) working space for integer alphabets of polynomial size in n.

The following corollary is immediate from Theorem 2.

Corollary 1

Given a set \(\mathcal {S} = \{S_1, S_2\}\) of two strings of total length n, one can compute \(\textsf{MAW}(S_1) \cap \textsf{MAW}(S_2)\), \(\textsf{MAW}(S_1) \cup \textsf{MAW}(S_2)\), and \(\textsf{MAW}(S_1) \bigtriangleup \textsf{MAW}(S_2)\) in \(O(n+|\textsf{MAW}(S_1) \cap \textsf{MAW}(S_2)|)\) time, \(O(n+|\textsf{MAW}(S_1) \cup \textsf{MAW}(S_2)|)\) time, and \(O(n+|\textsf{MAW}(S_1) \bigtriangleup \textsf{MAW}(S_2)|)\) time, respectively, using O(n) working space, for integer alphabets of polynomial size in n.

6 Algorithm for Arbitrary \(k > 2\)

In this section, we present our algorithm for computing \(\textsf{MAW}(\mathcal {S}_{\textbf{B}})\) in case where \(\mathcal {S} = \{S_1, \ldots , S_k\}\) contains \(k > 2\) strings.

Let \(\textbf{B}\in \{0,1\}^k \setminus \{0^k\}\) be an input bit vector of length \(k > 2\). We redefine the labels of the nodes of \(\textsf{DAWG}(\mathcal {S})\) such that \(\textsf{label}(v)[i] = 1\) iff v is a substring of \(S_i\) for \(1 \le i \le k\). Namely, \(\textsf{label}(v)\) is now also a bit vector of length k.

Let \(aub \in \varSigma ^*\) (\(a,b \in \varSigma \) and \(u \in \varSigma ^*\)) be a candidate of an element of \(\textsf{MAW}(\mathcal {S}_{\textbf{B}})\) as in the previous sections, where the suffix link of node au points to node u and node u has an out-going edge labeled b. Then, it follows from the definition of \(\textsf{MAW}(\mathcal {S}_{\textbf{B}})\) that \(aub \in \textsf{MAW}(\mathcal {S}_{\textbf{B}})\) iff

(A):

\(\textsf{label}(aub)[i] = 0\), \(\textsf{label}(au)[i] = 1\), and \(\textsf{label}(ub)[i] = 1\) (i.e. \(aub \in \textsf{MAW}(S_i)\)), or

(A’):

au has no out-going edge labeled b, \(\textsf{label}(au)[i] = 1\), and \(\textsf{label}(ub)[i] = 1\) (i.e. \(aub \in \textsf{MAW}(S_i)\))

for all \(1 \le i \le k\) with \(\textbf{B}[i] = 1\), and

(B):

\(\textsf{label}(aub)[i] = 1\) (i.e. \(aub \in \textsf{Substr}(S_i)\)), or

(C):

\(\textsf{label}(aub)[i] = 0\), and \(\textsf{label}(au)[i] = 0\) or \(\textsf{label}(ub)[i] = 0\) (i.e. \(aub \in \textsf{nonMAW}(S_i)\)), or

(C’):

au has no out-going edge labeled b, and \(\textsf{label}(au)[i] = 0\) or \(\textsf{label}(ub)[i] = 0\) (i.e. \(aub \in \textsf{nonMAW}(S_i)\))

for all \(1 \le i \le k\) with \(\textbf{B}[i] = 0\).

For each node au in \(\textsf{DAWG}(\mathcal {S})\) whose suffix link points to node u, we create a united single skip link \(\textsf{schar}(ub)\) for the children ub of node u such that \(b \in \textsf{schar}(ub)\) iff \(\textsf{label}(ub)[i] = 1\) for every i with \(\textbf{B}[i] = 1\).

After the above preprocessing is finished, we proceed to the scanning phase of our algorithm. For each node au, we scan the skip links \(\textsf{char}(aub)\) and \(\textsf{schar}(ub)\) in parallel, analogously to the case with \(k = 2\). Let \(\hat{b} \in \textsf{char}(aub)\) and \(b \in \textsf{schar}(ub)\). Our algorithm compares these characters in sorted order while keeping the invariant \(\hat{b} \succeq b\) as in the case with \(k = 2\).

When the comparison falls into the case “\(\hat{b} = b\)”, then we output aub as an element of \(\textsf{MAW}(\mathcal {S}_{\textbf{B}})\) if Case (A) is satisfied and if Case (B) or Case (C) is satisfied. When the comparison falls into the case “\(\hat{b} \succ b\)”, then we output aub as an element of \(\textsf{MAW}(\mathcal {S}_{\textbf{B}})\) if Cases (A’) and (C’) are both satisfied.

This already gives us an O(nk)-time algorithm for computing \(\textsf{MAW}(\mathcal {S}_{\textbf{B}})\) using \(O(n(k+\log n))\) bits of working space, or alternatively \(O(n \lceil k / \log n \rceil )\) words of working space in the word RAM model with machine word size \(\omega = \log n\).

We can speed up checking Cases (A), (B), (C) for each node au by using bit masks of size \(\omega = \log n\) each stored at nodes aub, au, and ub, from O(k) time to \(O(\lceil k / \log n \rceil )\) time. For Cases (A’) and (C’), it suffices for us to use only the bit masks stored at nodes au and ub, since node aub does not exist in these cases and we detect this as a result of “\(\hat{b} \succ b\)” comparison.

Theorem 3

(Efficient MAW computation for a set of k strings). Given a set \(\mathcal {S} = \{S_1, \ldots , S_k\}\) of k strings of total length n and a bit vector \(\textbf{B}\in \{0,1\}^k \setminus \{0^k\}\), one can compute \(\textsf{MAW}(\mathcal {S}_{\textbf{B}})\) in \(O(n\lceil k / \log n \rceil +|\textsf{MAW}(\mathcal {S}_{\textbf{B}})|)\) time and \(O(n(k+\log n))\) bits of working space (or alternatively \(O(n\lceil k / \log n \rceil )\) words of working space), for integer alphabets of polynomial size in n.

7 Discussions

Béal et al. [6] considered a different version of MAWs \(\textsf{MAW}'(\mathcal {S})\) for a set \(\mathcal {S}\) of k strings, where a string \(w = aub\) is a MAW for \(\mathcal {S} = \{S_1, \ldots , S_k\}\) if \(aub \notin \textsf{Substr}(\mathcal {S})\), \(au \in \textsf{Substr}(S_i)\) and \(ub \in \textsf{Substr}(S_j)\) for some \(1 \le i, j \le k\). They gave an \(O(\sigma n)\)-time and space solution for computing \(\textsf{MAW}'(\mathcal {S})\). This version of MAWs can be computed in optimal \(O(n+|\textsf{MAW}'(\mathcal {S})|)\) time, independently of k, by running our algorithm without skip links. Ayad et al. [3] considered the problem of computing the same version of MAWs of length up to \(\ell > 1\).

Independently to our work, the recent work by Béal and Crochemore [5] considered the following problem: Let \(\mathcal {T}\) and \(\mathcal {R}\) be sets of strings, where \(\mathcal {T}\) is called a target and \(\mathcal {R}\) is called a reference. A \(\mathcal {T}\)-specific string with respect to \(\mathcal {R}\) is a string u such that \(u \in \textsf{Substr}(\mathcal {T})\), \(u \notin \textsf{Substr}(\mathcal {R})\), \(v \in \textsf{Substr}(\mathcal {R})\) for any proper substring v of u. By definition, a string u is a \(\mathcal {T}\)-specific string with respect to \(\mathcal {R}\) if and only if \(u \in \textsf{MAW}(\mathcal {R}) \cap \textsf{Substr}(\mathcal {T})\). Béal and Crochemore [5] showed an algorithm for finding all \(\mathcal {T}\)-specific strings w.r.t. \(\mathcal {R}\) in \(O(n \sigma )\)-time and O(n) space, where n is the total length of the strings in \(\mathcal {T}\) and \(\mathcal {R}\), assuming that the edges of the DAWG are represented by transition matrices (Proposition 2, [5]). Their algorithm also uses the DAWG built on \(\mathcal {T}\) and \(\mathcal {R}\) and marks its nodes in an appropriate way (Proposition 1, [5]). This marking technique is very similar to our skip links from Sect. 5 for the case of \(k = 2\), and thus our algorithm can be extended to solve this problem in O(n) time and space for integer alphabets.