Keywords

1 Introduction

A palindrome is a string that is equal to its reverse, e.g., “Able_was_I_ere_I_saw_Elba” (we treat upper and lower characters are the same for simple explanations). Palindromes have been studied in combinatorics on words and stringology.

Many research focused on finding palindromic structure of a string. Manacher [12] proposed a beautiful algorithm that enumerates all maximal palindromes of a string. Kosolobov et al. [11] proved that, a language \(P^k\) can be recognizable in \(O(kN)\) time, where \(P\) is the language of all nonempty palindromes and \(N\) is the length of an input string. Alatabbi et al. [2] considered maximal palindromic factorization in which all factors are maximal palindromes. They also consider a problem of computing the fewest palindromic factorization, and proposed off-line linear-time algorithms. Later, I et al. [9] and Fici et al. [4] independently proposed on-line \(O(N\log N)\)-time algorithms, where \(N\) is the length of an input string. Similar problems were also considered, such as, computing palindromic length [3], computing palindromic covers [9], computing palindromic pattern matching [8].

A gapped palindrome is a generalization of a palindrome that becomes a palindrome when a center substring is replaced by a character, where the center substring is a substring whose beginning and ending positions are equally far from the beginning and ending positions of the input string, respectively. For example, “Madam,_he_is_Adam” is a gapped palindrome, and it becomes a palindrome if the center substring “m,_he_is_” is replaced by a character. Gapped palindromes play an important role in molecular biology since they model a hairpin data structure of DNA and RNA sequences, see e.g. [14]. Several problems were considered such as, enumeration of exact gapped palindromes of a string [10] and also enumeration of approximate gapped palindromes [7, 13], finding maximal length of long armed or and constrained length gapped palindrome [5].

In this paper, we consider the notion of block palindromes [1], which is a new generalization of palindromes and also gapped palindromes Footnote 1. A block palindrome is a string that becomes a palindrome when identical substrings are replaced with a distinct character. More precisely, a block palindrome is a “symmetric” factorization \(f=f_{-n} \cdots f_{-1} f_0 f_1 \cdots f_n\) of a string with the center factor \(f_0\) is a string (which may be empty) and each of other factor \(f_{-i}=f_{i}\) is a non-empty string for any \(1 \le i\le n\). We also call a factor a block. For convenience, let \(f=f_0\) when \(n=0\). For example, a factorization “To|kyo|_|and|_|Kyo|to” is a block palindrome, where “|” is a mark to distinguish adjacent blocks. Palindromes and gapped palindromes are special cases of block palindromes: For a palindrome, all blocks are characters, and for a gapped palindrome, the center block \(f_0\) is a string and the other blocks are characters.

We investigate several properties of block palindromes. We introduce the notion of maximal block palindromes to concisely represent all block palindromes in a string, and propose an algorithm which enumerates all maximal block palindromes in a string \(T\) in \(O(|T| + \Vert MBP (T)\Vert )\) time, where \(\Vert MBP (T)\Vert \) is the output size. This is optimal unless all the maximal block palindromes can be represented in a more compact way.

2 Preliminaries

Let \(\varSigma \) be an integer alphabet. An element of \(\varSigma ^*\) is called a string. The string of length 0 is called the empty string, and is denoted by \(\varepsilon \). Although \(\varepsilon \) is not contained in \(\varSigma \), we sometimes call \(\varepsilon \) the empty character for convenience. For a string \(T=xyz\), \(x\), \(y\) and \(z\) are called a prefix, substring, and suffix of \(T\), respectively. In particular, a prefix (resp. suffix) \(x\) of \(T\) is called a proper prefix (resp. suffix) iff \(x\ne T\). A non-empty string that is a proper prefix and also a proper suffix of \(T\) is called a border of \(T\). Hence, a string of length \(N\) can have at most \(N-1\) borders of length ranging from 1 to \(N-1\). A string which does not have any borders is called an unbordered string. For \(1 \le i\le j\le |T|\), a substring of \(T\) which begins at position \(i\) and ends at position \(j\) is denoted by \(T[i\ldots j]\). For convenience, let \(T[i\ldots j] = \varepsilon \) if \(j< i\).

In this paper, we also consider half-positions \(k+1/2\) for integers \(0 \le k\le |T|\). For convenience, for a half-position \(i\) and an integer \(r\) such that \(1/2 \le i-r\le i+ r\le |T|+1/2\), let \(T[i-r\ldots i+r] = T[\lceil i-r \rceil \ldots \lfloor i+r \rfloor ]\). Note that \(T[i]\) for a half-position \(i\) is the empty character. The position \(c=(|T|+1)/2\) is called the center position of \(T\), \(T[c]\) is called the center character of \(T\), and \(T[c-d\ldots c+d]\) for an integer \(d\) is called a center substring of \(T\).

For a string \(T\) and integers \(1 \le i, j\le |T|\), a longest common extension (LCE) query \( LCE _T(i, j)\) asks the length of the longest common prefix of the two suffixes \(T[i\ldots |T|]\) and \(T[j\ldots |T|]\) of \(T\). When clear from the context, \( LCE _T(i, j)\) is abbreviated as \( LCE (i, j)\). It is well known that if \(T\) is drawn from an integer alphabet of size polynomially bounded in \(|T|\), then LCE queries for \(T\) can be answered in constant time after an \(O(|T|)\)-time preprocessing, e.g., by constructing the suffix tree of \(T\) and a data structure for lowest common ancestor queries on the tree [6].

For a block palindrome \(f=f_{-n} \cdots f_{-1} f_0 f_1 \cdots f_n\), the length of \(f\) denoted by \(|f|\) is the total length of blocks, and the size of \(f\) denoted by \(\Vert f\Vert \) is the number of non-empty blocks. A block palindrome is even if its size is even (that is, the center block \(f_0\) is the empty string), and otherwise odd (that is, the center block \(f_0\) is non-empty).

3 Properties of Block Palindromes

In this section, we investigate the properties of block palindromes. We assume that \(T\) is an input string of length \(N\) in the rest of the paper.

Since there are \(O(2^N)\) factorization of \(T\) and block palindromes are symmetric, there are \(O(2^{N/2})\) block palindromes of \(T\). Moreover, there is a tight example that \(T\) consists of only the same characters.

Although there are a huge number of block palindromes of a string, they are very redundant. To look for more essential properties of block palindromes, we define the largest block palindrome which is a representative of other block palindromes. A block palindrome \(f=f_{-n} \cdots f_{n}\) of \(T\) that has the largest number of blocks among all block palindromes of \(T\) is called the largest block palindrome. Note that each block \(f_i\) for \(0 \le i\le n\) is an unbordered substring and \(f_i\) for \(0 < i\le n\) is the shortest border of \(T[k+ 1 \ldots N- k]\), where \(k=0\) if \(i=n\) and \(k=|f_{i+1} \cdots f_n|\) otherwise. So, the largest block palindrome of \(T\) is unique. The largest block palindrome is a representative of all block palindromes in the sense that all block palindromes can be represented as block palindromes of \(f\).

A natural and prompt question would be about how to efficiently compute the largest block palindrome of \(T\). The following theorem answers this question.

Theorem 1

The largest block palindrome \(f_{-n} \cdots f_{n}\) of \(T\) can be computed in \(O(N)\) time.

Proof

We construct a data structure in \(O(N)\) time that can answer any LCE query in constant time.

We greedily compute the blocks from outside \(f_{n}\) to inner \(f_{1}\) by LCE queries. We assume that we compute the shortest border \(f_{i}\) of \(T[b\ldots e]\). For \(k=1\) to \(\lfloor (e- b+ 1)/2 \rfloor \), we check whether \(T[b\ldots b+ k- 1]\) is the border of \(T[b\ldots e]\) or not by checking whether \( LCE (b, e-k+1) \ge k\) or not. If \(T[b\ldots e]\) does not have any border, we obtain \(f_0 = T[b\ldots e]\). Otherwise, we obtain the shortest border \(f_i=T[b\ldots b+ k-1]\) of \(T[b\ldots e]\), and compute the more inner blocks for \(T[b+ k\ldots e- k]\). Since the number of LCE queries is \(O(N)\) and each LCE query takes constant time, the largest block palindrome of \(T\) can be computed in \(O(N)\) time.    \(\square \)

So far, we have considered only block palindromes that are equal to \(T\) itself. Next, we consider block palindromes that appear as substrings in \(T\). We define a maximal block palindrome which is a representative of some block palindromes in \(T\), and study how many maximal block palindromes can appear in \(T\).

For a half-position \(1 \le c\le N\) and an integer \(1 \le d\le N/2\), let \(F_T(c, d)=\{f| f=f_{-n} \cdots f_0 \cdots f_{n} \text { is the largest block palindrome}\), \(f_0=T[c- d+ 1 \ldots c+ d-1], f=T[c- d- k+1 \ldots c+ d+ k-1], k= |f_1 \cdots f_n| \}\) be the set of largest block palindromes whose center positions are the same and whose center blocks appear at \(T[c-d+1 \ldots c+d-1]\). When context is clear, we denote \(F_{T}\) by \(F\). For a string \(T\), a largest block palindrome \(f\in F(c, d)\) such that \(|f|\) is the longest, namely the number of blocks are maximal among all largest block palindromes of \(F(c, d)\), is called a maximal block palindrome.

We remark that the maximal block palindrome of \(F(c, d)\) is a representative of all the largest block palindromes of \(F(c, d)\).

Remark 1

For a half-position \(1 \le c\le N\) and an integer \(1 \le d\le N/2\), any largest block palindrome \(f=f_{-n} \cdots f_{n} \in F(c, d)\) is a sub-factorization of the maximal block palindrome \(g= g_{-m} \cdots g_{m} \in F(c, d)\), that is, \(n\le m\) and \(f_{i}=g_{i}\) for \(0 \le i\le n\).

Proof

We assume that the statement does not hold. Let \(f_j\) be a block that \(f_{j} \ne g_{j}\), and \(j=0\) or \(f_{i}=g_{i}\) for \(0 \le i< j\le n\). If \(|f_{j}| < |g_{j}|\), \(f_{j}\) is a border of \(g_{j}\) and it contradicts that \(g_{j}\) is the largest block palindrome. We can say the same things for the case \(|f_{j}| > |g_{j}|\). Therefore, such \(f_j\) and \(g_j\) do not exist and this statement holds.    \(\square \)

We are interested in how many maximal block palindromes can appear in \(T\). It is trivially upper bounded by \(O(N^2)\) since there are \(O(N^2)\) substrings which can be center substrings. If there is no limitation on the size of maximal block palindromes, we can easily see that it is tight. For a string \(T\) of length \(N\) in which the characters are all distinct, any substring \(w\) is unbordered, and there is at least one maximal block palindrome that contains \(w\) as a center block. Thus, \(T\) can contain \(\varTheta (N^2)\) maximal block palindromes. The following example says that the number of maximal block palindromes having three blocks has also the same tight upper bound.

Example 1

The number of maximal block palindromes in \(T=\mathtt {a}^n\mathtt {b}^n\mathtt {a}\mathtt {b}\mathtt {a}^n\mathtt {b}^n\) that have at least three blocks is \(\varTheta (N^2)\), where \(c^n\) for a character \(c\) denotes run of \(c\) of length \(n\), and \(n=(N-2)/4\).

For convenience, we denote \(T\) by \(T=A_0 B_1 A_1 B_2 A_2 B_3\), where \(A_0\), \(B_1\), \(A_1\), \(B_2\), \(A_2\), and \(B_3\) are strings \(\mathtt {a}^n\), \(\mathtt {b}^n\), \(\mathtt {a}\), \(\mathtt {b}\), \(\mathtt {a}^n\), and \(\mathtt {b}^n\), respectively. There are maximal block palindromes of size three that, for \(1<i\le n\), \(1<j\le n\), \(T[n-j+1 \ldots N-n+i-1] = (A_0[n-j+1 \ldots n] B_1[1..i-1])(B_1[i\ldots n]A_1 B_2 A_2[1 \ldots j])(A_2[n-j+1 \ldots n]B_3[1 \ldots i-1])\) and they are unbordered, where the parentheses indicate blocks.

Remark that the upper bound is reduced to \(O(N)\) if we impose a limitation on the lengths of center blocks.

Remark 2

For any constant \(k\ge 0\), a string of length \(N\) can contain \(\varTheta (N)\) maximal block palindromes whose center blocks are of length \(\le k\) because there are \(O(N)\) possible center blocks. In particular, a string contains at most \(N- 1\) maximal block palindromes of even size (i.e., the center blocks must be empty) because the number of occurrences of center blocks are at most \(N-1\).

The following lemma shows an interesting property of maximal block palindromes, and this property can be used for the proof of Lemma 2.

Lemma 1

For a half-position \(1 \le c\le N\) and two integers \(1 \le d<d^\prime \le N/2\), two largest block palindromes \(f=f_{-n} \cdots f_{n} \in F(c, d)\) and \(g= g_{-m} \cdots g_{m} \in F(c, d^\prime )\) do not share the block boundaries, namely, the ending positions of blocks \(k_i\) and \(k^\prime _i\) such that \(k_i=c+ d-1 + |f_1 \cdots f_i|\) and \(k^\prime _i=c+ d^\prime -1 + |g_1 \cdots g_j|\) do not equal for any \(1 \le i\le n\) and \(1 \le j\le m\).

Proof

Similar to Remark 1, if we assume that this lemma does not hold, a block of \(f\) or \(g\) must have a border and it contradicts that \(f\) and \(g\) are the largest block palindromes.    \(\square \)

Let \(\Vert MBP (T)\Vert \) denote the sum of the sizes of all maximal block palindromes in \(T\).

Lemma 2

For any string \(T\) of length \(N\), \(\Vert MBP (T)\Vert \le N(2N-1)\).

Proof

From Lemma 1, any two largest block palindromes, whose center positions are same but center blocks are different, do not share the block boundaries. This implies that, for a half-position \(c\), the number of blocks of maximal block palindromes whose center position is \(c\) is up to \(N\). Since there are \(2N-1\) center positions, we have \(\Vert MBP (T)\Vert \le N(2N-1)\).    \(\square \)

4 Enumeration of Maximal Block Palindromes

In this section, we consider how to enumerate all the maximal block palindromes \( MBP (T)\). A brute-force approach based on Theorem 1 would compute the largest block for every possible substring \(T[b\ldots b+ \ell - 1]\) (while suppressing output of non-maximal ones), which takes \(\varTheta (\sum _{\ell = 1}^{N} \ell (N- \ell )) = \varTheta (N^3)\) time.

We propose an optimal solution running in \(o(N^3)\) time.

Theorem 2

All maximal block palindromes that appear in \(T\) can be enumerated in \(O(N+ \Vert MBP (T)\Vert )\) time, where \(\Vert MBP (T)\Vert \) is the output size.

We actually consider a variant of the problem: We propose an algorithm to enumerate all the maximal block palindromes of size \({\ge }2\), whose total output size is denoted by \(\Vert MBP _{\ge 2}(T)\Vert \), in optimal \(O(N+ \Vert MBP _{\ge 2}(T)\Vert )\) time. That is to say, we can completely ignore maximal block palindromes of size 1, which might not be interesting if we focus on palindromic structures in \(T\). If we want to enumerate \( MBP (T)\), we can do that by slightly modifying the algorithm.

Our algorithm proceeds in two steps: (i) enumerate all the pairing unbordered blocks for all center positions in a batch processing, and (ii) build maximal block palindromes from the enumerated blocks.

In Step (i), we firstly enumerate every pair of occurrences of an unbordered substring in \(T\). Note that the pair will be a component of a maximal block palindrome, and the total number of enumerated pairs is \(O(\Vert MBP _{\ge 2}(T)\Vert )\). We preprocess \(T\) in \(O(N)\) time and space to support LCE queries in constant time. We also compute, for every character in \(T\), the list storing all the occurrences of the character in increasing order, all of which can be obtained by sorting the positions \(i\) of \(T\) with the key \(T[i]\) by radix sort in \(O(N)\) time and space.

Now we focus on an occurrence \(b\) of \(T[b]\), and identify every pair of occurrences of an unbordered substring such that the left one starts at \(b\). Let \(b< b_1< b_2< \cdots < b_k\) be the occurrences of \(T[b]\) in \(T[b\ldots N]\). We process \(b_i \in \{b_1, \ldots , b_k\}\) in increasing order to identify common unbordered substrings starting at \(b\) and \(b_i\) using \( LCE \) queries. At the first round for \(b_1\), we see that for any \(\ell \) with \(1 \le \ell \le \min ( LCE (b, b_1), b_1 - b)\), the common substring of length \(\ell \) starting at \(b\) and \(b_1\) is unbordered, and thus, we report each pair of such unbordered substrings. While processing \(b_i \in \{b_1, \ldots , b_k\}\) in increasing order, we maintain a set \(L\) of positive integers \(\ell \) (by a sorted list of intervals) such that \(T[b\ldots b+ \ell - 1]\) has a border caused by the common substrings starting at \(b\) and \(b_i\)’s processed so far. We use \(L\) to efficiently skip \(\ell \)’s such that \(T[b\ldots b+ \ell - 1]\) has a border in the later rounds. For example, in the first round, we add the interval \([b_1 - b+ 1 \ldots b_1 - b+ LCE (b, b_1)]\) to \(L\) (which is initially empty) as, for any \(\ell \in [b_1 - b+ 1 \ldots b_1 - b+ LCE (b, b_1)]\), \(T[b\ldots b+ \ell - 1]\) has a border caused by the common substring starting at \(b\) and \(b_1\). When processing \(b_i\) for \(1 < i \le k\), we see that for any \(\ell \in [1 \ldots \min ( LCE (b, b_i), b_i - b)] \setminus L\), the common substring of length \(\ell \) starting at \(b\) and \(b_i\) is unbordered. Updating \(L\) can be easily done in O(1) time by adding (merging if necessary) the interval \([b_i - b+ 1 \ldots b_i - b+ LCE (b, b_i)]\) to \(L\) (observe that the new interval is always pushed back to \(L\) or merged with the last interval of \(L\) as we process \(\{b_1, \ldots , b_k\}\) in increasing order). Note that \([1 \ldots \min ( LCE (b, b_i), b_i - b)] \setminus L\) always contains 1, and we can incrementally enumerate its element in constant time per element because \(L\) is maintained as a sorted list of intervals. Thus, the computation cost can be charged to the number of output, i.e., it runs in \(O(N+ \Vert MBP _{\ge 2}(T)\Vert )\) time in total.

When we find a pair of occurrences \(b_{l} < b_{r}\) of an unbordered substring of length \(\ell \), we list it up as a triple \((c, b_{r}, b_{r} + \ell )\), where \(c= (b_{l} + b_{r} + \ell - 1) / 2\) is the center of the pairing blocks. After listing up all those triples, we sort them using the first and second elements as keys by radix sort, which can be done in \(O(N+ \Vert MBP _{\ge 2}(T)\Vert )\) time and space.

Now we are ready to proceed to Step (ii) in which we build the maximal block palindromes from the sorted list of triples computed in Step (i). For building the maximal block palindromes with center \(c\), we scan the sublist of triples having center \(c\) and connect the pairing blocks whose beginning and ending positions are adjacent using the information of the second (the beginning position of the block) and third (the ending position of the block plus one) elements of the triples. We build all the \(c\)-centered maximal block palindromes by extending their blocks outwards simultaneously with a 0-initialized array A of length \(N\). When we look at a triple \((c, b_{r}, b_{r} + \ell )\), we write \(b_{r}\) to \(A[b_{r} + \ell ]\), and connect the block with the block ending at \(b_{r} - 1\) if such exists (which can be noticed by the information \(A[b_{r}] \ne 0\)). Since the block boundaries are not shared due to Lemma 1, the information written in A can be propagated correctly to extend the blocks. It runs in time linear to the size of the sublist. We can also clear A in the same time by scanning the sublist again while writing 0 to the entries we touched.

Since the initialization cost \(O(N)\) of A is payed once in the very beginning of Step (ii) and the other computation cost can be charged to the output size, the total time complexity is \(O(N+ \Vert MBP _{\ge 2}(T)\Vert )\).

For enumerating \( MBP (T)\), we modify Step (ii). While scanning the sublist for center \(c\), we can identify all the positions \(e\ge c\) such that \(e\) is not an ending position of some pairing block, for which the substring \(T[2 c- e\ldots e]\) is unbordered. If the unbordered substring cannot be extended outwards by blocks (which can also be checked while scanning the sublist), it is the maximal block palindrome of size 1 to output for \( MBP (T)\). The algorithm runs in \(O(N+ \Vert MBP (T)\Vert )\) time in total as the additional cost can be charged to the output size.

5 Conclusions

In this paper, we investigated several properties of block palindromes which are the generalization of palindromes and gapped palindromes. We also proposed an optimal-algorithm to enumerate all maximal block palindromes appearing in a given string. As mentioned in Remark 2, if we impose a limitation on the lengths of center blocks, the upper bound of the number of maximal block palindromes is reduced to \(O(N)\), where \(N\) is the length of an input string. In particular, for maximal block palindromes of even size, the center blocks are super restricted to be empty. The situation is similar to considering ordinal palindromes (in which the center blocks are strict) versus maximal gapped palindromes (in which the restriction on the center blocks are relaxed). It would be interesting to investigate the properties of maximal block palindromes whose center blocks have restricted lengths and develop efficient algorithms to enumerate only such a subset of maximal block palindromes.