Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

A palindrome is a symmetric word that reads the same backward and forward. The detection of palindromes is a classical and well-studied problem in computer science, language theory and algorithm design with a lot of variants arising out of different practical scenarios. String and sequence algorithms related to palindromes have long drawn the attention of stringology researchers [3, 11, 17]. Interestingly, in the seminal Knuth-Morris-Pratt paper presenting the well-known string matching algorithm [16], a problem related to palindrome recognition was also considered. In combinatorics on words, for example, studies have investigated the inhabitation of palindromes in Fibonacci words or Sturmian words in general [6, 7]. There is also an interesting conjecture related to periodicity of infinite strings whose every factor can be decomposed into a bounded number of palindromes [9].

In computational biology, palindromes play an important role in regulation of gene activity and other cell processes because these are often observed near promoters, introns and specific untranslated regions. Hairpins (also called complemented palindromes) in the HIV virus are strings of the form \(x \bar{x}^R\), where \(\bar{x}^R\) is the reverse complement of x, while (even) palindromes are strings of the form \(x x^R\). Algorithms for detecting palindromes can often be adapted to compute hairpins as well. Hence, we can identify palindromes in the DNA sequence and then align the part of the DNA sequence that contains them with the HIV virus.

In the beginnings of algorithmic study of palindromes, Manacher discovered an on-line sequential algorithm that finds all initial palindromes in a string [19]. A string \(S[1 ..n]\) is said to have an initial palindrome of length k if \(S[1 ..k]\) is a palindrome. Later Apostolico et al. observed that the algorithm given by [19] is able to find all maximal palindromic factors in the string in \(\mathcal {O}(n)\) time [2]. Gusfield gave another linear-time algorithm to find all maximal palindromes in a string [14]. He also discussed the relation between biological sequences and gapped (separated) palindromes (i.e. strings of the form \(x v\bar{x}^R\)). Gupta et al. [13] presented an \(\mathcal {O}(n)\)-time algorithm to compute specific classes—based on length constraints—of such palindromes. Algorithms for finding the so-called gapped palindromes were also considered in [10, 17]. (In our study, we consider gaps between palindromes, not inside them.)

A problem that gained significant attention recently was decomposing a string into a minimal number of palindromes; any such decomposition is called a palindromic factorization. Fici et al. [8] presented an on-line \(\mathcal {O}(n \log n)\)-time algorithm for computing a palindromic factorization of a string of length n. A similar on-line algorithm was presented by I et al. [15] as well as an on-line algorithm with the same time complexity to factorize a string into maximal palindromes. Alatabbi et al. gave an off-line \(\mathcal {O}(n)\)-time algorithm for the latter problem [1]. In addition, Rubinchik and Shur [20] devised an \(\mathcal {O}(n)\)-sized data structure that helps locate palindromes in a string; they also show how it can be used to compute the palindromic factorization of a string in \(\mathcal {O}(n \log n)\) time.

A similar problem, first studied by Galil and Seiferas in [12], asked whether a given string can be decomposed into k palindromes. Galil and Seiferas [12] presented an on-line \(\mathcal {O}(n)\)-time algorithm for \(k=1,2\) and an off-line \(\mathcal {O}(n)\)-time algorithm for \(k=3,4\). In 2014, Kosolobov et al. presented an on-line \(\mathcal {O}(kn)\)-time algorithm to decide this for arbitrary k [18].

Our work is a continuation of this line of research, motivated by possible errors and inconsistencies in the biological data. We extend the previous work by introducing a constraint on the length of the palindromes and allowing gaps and errors in the decompositions. By gaps we mean regions of the string that are not decomposed into palindromes of sufficient length. We allow errors in the palindromes, so that a palindrome with errors is a string having a small Hamming or edit distance from an ideal palindrome. We present two approaches for decomposing a string into sufficiently long palindromes; one allowing only gaps in the decomposition and the other allowing both gaps in the decomposition and errors in the palindromes. We first present an algorithm that computes a palindromic decomposition with the minimal total gap length of a string of length n in time \(\mathcal {O}(n \log {n} \cdot g)\) and space \(\mathcal {O}(n \cdot g)\), where g is the number of allowed gaps. Secondly, we present an \(\mathcal {O}(n \cdot (g + \delta ))\)-time and \(\mathcal {O}(n \cdot g)\)-space algorithm for the decomposition of a string of length n into maximal palindromes with at most \(\delta \) errors each, under the Hamming or edit distance, and g allowed gaps. The algorithms can be applied for both standard and complemented palindromes.

2 Notation and Terminology

Let \(S=S[1] S[2]\cdots S[n]\) be a string of length \(|S|=n\) over an alphabet \(\varSigma \). We consider the case of an integer alphabet; in this case each letter can be replaced by its rank so that the resulting string consists of integers in the range \(\{1,\ldots ,n\}\). For two positions i and j, where \(1 \le i \le j \le n\), in S, we denote the factor \(S[i]S[i+1]\cdots S[j]\) of S by \(S[i ..j]\). We denote the reverse string of S by \(S^R\), i.e. \(S^R=S[n]S[n-1] \cdots S[1]\). The empty string (denoted by \(\varepsilon \)) is the unique string over \(\varSigma \) of length 0. A string S is said to be a palindrome if and only if \(S=S^R\). If \(S[i ..j]\) is a palindrome, the number \(\frac{i+j}{2}\) is called the center of \(S[i ..j]\). Let \(S[i ..j]\), where \(1 \le i \le j \le n\), be a palindromic factor in S. It is said to be a maximal palindrome if there is no longer palindrome in S with center \(\frac{i+j}{2}\). Note that a maximal palindrome can be a factor of another palindrome.

Definition 1

We say that \(S=p_1 p_2 \cdots p_{\ell }\) is a (maximal) palindromic decomposition of S if all the strings \(p_i\) are (maximal) palindromes.

Definition 2

A (maximal) palindromic decomposition of S such that the number of (maximal) palindromes is minimal is called a (maximal) palindromic factorization of S.

Note that any single letter is a palindrome and, hence, every string can always be decomposed into palindromes. However, not every string can be decomposed into maximal palindromes; e.g. consider \(S=\texttt {abaca}\) [1].

Let f be an involution on the alphabet \(\varSigma \), i.e., a function such that \(f^2=\mathrm {id}\). We extend f into a morphism on strings over \(\varSigma \). We say that a string x is a generalized palindrome if \(x=f(x^R)\). Two known notions fit this definition:

  • If \(f=\mathrm {id}\), then a generalized palindrome is a standard palindrome.

  • If \(\varSigma =\{\mathtt {A},\mathtt {C},\mathtt {G},\mathtt {T}\}\) and \(f(\mathtt {A})=\mathtt {T}\), \(f(\mathtt {C})=\mathtt {G}\), \(f(\mathtt {G})=\mathtt {C}\), \(f(\mathtt {T})=\mathtt {A}\), then a generalized palindrome corresponds to a so-called complemented palindrome [14].

Example 3

The string \(\mathtt {A~G~T~A~C~T~T~C~A~T~G~A}\) is a standard palindrome and the string \(\mathtt {T~A~G~T~C~G~A~C~T~A}\) is a complemented palindrome.

We also consider (generalized) palindromes with errors. Let us recall two well-known metrics on strings. Let u and v be two strings. If \(|u|=|v|\), then the Hamming distance between u and v is the number of positions where u and v do not match. The edit (or Levenshtein) distance between u and v is the minimum number of edit operations (insertions, deletions, substitutions) needed to transform u into v. We say that x is a generalized \(\delta \) -palindrome under the Hamming distance (or the edit distance) if the minimum Hamming distance (edit distance, respectively) from x to any generalized palindrome is at most \(\delta \).

A generalized palindrome \(S[i ..j]\) is called maximal if there is no longer generalized palindrome with the same center. Similarly, a generalized \(\delta \)-palindrome \(S[i ..j]\) under the Hamming/edit distance is called maximal if there is no longer generalized \(\delta \)-palindrome under the same distance measure with the same center.

Example 4

All maximal 0-palindromes/1-palindromes in \(\mathtt {GTATCG}\) (for \(f = \mathrm {id}\)) under the Hamming and under the edit distance are as follows:

Center

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6

0

\(\mathtt {G}\)

\(\mathtt {\varepsilon }\)

\(\mathtt {T}\)

\(\mathtt {\varepsilon }\)

\(\mathtt {TAT}\)

\(\mathtt {\varepsilon }\)

\(\mathtt {T}\)

\(\mathtt {\varepsilon }\)

\(\mathtt {C}\)

\(\mathtt {\varepsilon }\)

\(\mathtt {G}\)

1 under Hamming

\(\mathtt {G}\)

\(\mathtt {GT}\)

\(\mathtt {GTA}\)

\(\mathtt {TA}\)

\(\mathtt {GTATC}\)

\(\mathtt {AT}\)

\(\mathtt {ATC}\)

\(\mathtt {TC}\)

\(\mathtt {TCG}\)

\(\mathtt {CG}\)

\(\mathtt {G}\)

1 under edit

\(\mathtt {G}\)

\(\mathtt {GT}\)

\(\mathtt {GTA}\)

\(\mathtt {GTAT}\)

\(\mathtt {GTATC}\)

\(\mathtt {GTATCG}\)

\(\mathtt {ATC}\)

\(\mathtt {TC}\)

\(\mathtt {TCG}\)

\(\mathtt {CG}\)

\(\mathtt {G}\)

For instance, the whole string GTATCG is a 1-palindrome under the edit distance, as deleting its fifth letter yields a palindrome GTATG.

The computational problems we study can be formally stated as follows.

figure a
figure b

We apply several instances of dynamic programming. For simplicity of presentation, we only show how to compute the minimal total length of gaps and omit describing the retrieval of the decomposition itself. To compute the latter, in each of the dynamic programming matrices we would store a pointer to the cell that gave us the minimum value so that we could actually compute the decomposition with the minimal total length of the gaps by backtracing.

3 Palindromic Decomposition with Gaps

In this section we develop an efficient solution to the Generalized Palindromic Decomposition with Gaps problem. It is based on several transformations of the algorithm for computing a palindromic factorization by Fici et al. [8]. For a string S of length n this algorithm works in \(\mathcal {O}(n \log n)\) time. The algorithm consists of two steps:

  1. 1.

    Let \(P_j\) be the sorted list of starting positions of all palindromes ending at position j in S. This list may have size \(\mathcal {O}(j)\). However, it follows from combinatorial properties of palindromes that the sequence of consecutive differences in \(P_j\) is non-increasing and contains at most \(\mathcal {O}(\log j)\) distinct values. Let \(P_{j,\varDelta }\) be the maximal sublist of \(P_j\) containing elements whose predecessor in \(P_j\) is smaller by exactly \(\varDelta \). Then there are \(\mathcal {O}(\log j)\) such sublists in \(P_j\). Hence, \(P_j\) can be represented by a set \(G_j\) of size \(\mathcal {O}(\log j)\) which consists of triples of the form \((i,\varDelta ,k)\) that represent \(P_{j,\varDelta } = \{i,i+\varDelta ,\ldots ,i+(k-1)\varDelta \}\). The triples are sorted according to decreasing values of \(\varDelta \) and all starting positions in each triple are greater than in the previous one. Fici et al. show that \(G_j\) can be computed from \(G_{j-1}\) in \(\mathcal {O}(\log j)\) time.

  2. 2.

    Let PL[j] denote the number of palindromes in a palindromic factorization of \(S[1 ..j]\). Fici et al. show that it can be computed via a dynamic programming approach, using all palindromes from \(G_j\) in \(\mathcal {O}(\log j)\) time. Their algorithm works as follows. Let \(PL_\varDelta [j]\) be the minimum number of palindromes we can decompose \(S[1 ..j]\) in, provided that we use a palindrome from \((i,\varDelta ,k) \in G_j\). Then \(PL_\varDelta [j]\) can be computed in constant time using \(PL_\varDelta [j-\varDelta ]\) based on the fact that if \((i,\varDelta ,k) \in G_j\) and \(k \ge 2\), then \((i,\varDelta ,k-1) \in G_{j-\varDelta }\). Exploiting this fact, \(PL_\varDelta [j]\) can be computed by only considering \(PL_\varDelta [j-\varDelta ]\) and the shortest palindrome in \((i,\varDelta ,k)\). Finally, we compute PL[j] from all such \(PL_\varDelta [j]\) values.

In Appendix A we show for completeness that the same approach works for generalized palindromes for any involution f.

To solve the Generalized Palindromic Decomposition with Gaps problem, we first need to modify each of the triples in \(G_j\) to reflect the length constraint (m). More precisely, due to the length constraint, in each \(G_j\) some triples will disappear completely, and at most one triple will get trimmed (i.e. the parameter k will be decreased).

Our algorithm then computes an array \(MG[1 ..n][0 ..g]\) such that MG[j][q] is the minimum possible total length of gaps in a palindromic decomposition of \(S[1 ..j]\), provided that there are at most q gaps. Simultaneously, our algorithm computes an auxiliary array \(MG'[1 ..n][0 ..g]\) such that \(MG'[j][q]\) is the minimum possible total length of gaps up to position j provided that this position belongs to a gap: at most the q-th one.

For \(j>0\) and \(q \ge 0\) we have the following formula:

$$MG[j][q] = \min (MG'[j][q],\min _\varDelta \{MG_\varDelta [j][q]\})$$

where \(MG_\varDelta [j][q]\) is the partial minimum computed only using generalized palindromes from \((i,\varDelta ,k) \in G_j\). The formula means: either we have a gap at position j, or we use a generalized palindrome ending at position j. We also set \(MG[0][q]=0\) for any \(q \ge 0\).

We compute \(MG_\varDelta [j][q]\) for \((i,\varDelta ,k) \in G_j\) using the same approach as Fici et al. [8] used for \(PL_\varDelta \), ignoring the triples that disappear due to the length constraint. If there is a triple that got trimmed, then the corresponding triple at position \(j-\varDelta \) (from which we reuse the values in the dynamic programming) must have got trimmed as well. More precisely, if the triple \((i, \varDelta , k)\) is trimmed to \((i, \varDelta , k')\) at position j, then at position \(j-\varDelta \) there is a triple \((i, \varDelta , k-1)\) which is trimmed to \((i, \varDelta , k'-1)\); that is, by the same number of generalized palindromes. Consequently, to compute \(MG_\varDelta [j][q]\) from \(MG_\varDelta [j-\varDelta ][q]\), we need to include one additional generalized palindrome (the shortest one in the triple) just as in Fici et al.’s approach.

Example 5

Consider the string AACCAACCAACCAACCAA, \(f=\mathrm {id}\), and let \(m=7\).

figure c

Then \(G_{18}=\{(1, \infty , 1), (5,4,4), (18,1,1) \}\), where:

  • \((1, \infty , 1)\) represents the whole string,

  • (5, 4, 4) represents \(\{ \texttt {AACCAACCAACCAA}, \texttt {AACCAACCAA}, \texttt {AACCAA}, \texttt {AA} \}\) which will get trimmed by 2 palindromes due to the length constraint, becoming (5, 4, 2),

  • (18, 1, 1) represents \(\{\texttt {A}\}\) and disappears.

Now looking at position \(j-\varDelta =18-4=14\) for the trimmed group, we had \((5,4,3) \in G_{14}\) representing \(\{ \texttt {AACCAACCAA}, \texttt {AACCAA}, \texttt {AA} \}\), and this also gets trimmed by 2 palindromes, becoming (5, 4, 1).

Finally, for \(j>0\) and \(q>0\) we compute \(MG'\) using the following formula:

$$MG'[j][q] = \min (MG'[j-1][q],MG[j-1][q-1])+1.$$

The first case corresponds to continuing the gap from position j, whereas the second to using a generalized palindrome finishing at position \(j-1\) or a gap finishing at position \(j-1\) (the latter will be suboptimal). Here the border cases are \(MG'[j][0] = \infty \) for \(j \ge 0\) and \(MG'[0][q] = \infty \) for \(q>0\).

Thus we arrive at the complete solution to the problem.

Theorem 6

The Generalized Palindromic Decomposition with Gaps problem can be solved in \(\mathcal {O}(n \log n \cdot g)\) time and \(\mathcal {O}(n\cdot g)\) space.

4 Computing Maximal Palindromes with Errors

Recall that all maximal (standard) palindromes in a string can be computed in \(\mathcal {O}(n)\) time by Manacher’s [5, 19] and Gusfield’s [14] algorithms. These algorithms perform different computations for odd- and for even-length palindromes. Recall that we defined the centers of odd-length palindromes as integers and the centers of even-length palindromes as odd multiples of \(\frac{1}{2}\).

Gusfield’s algorithm [14] applies Longest Common Extension (LCE) Queries in the string \(T=S \$ S^R\), where \(\$ \not \in \varSigma \) is a sentinel character. An LCE(ij) query returns the length of the longest common prefix of the suffixes \(T[i ..|T|]\) and \(T[j ..|T|]\). For example, to compute the length of the maximal even-length palindrome centered between positions i and \(i+1\), the algorithm computes \(LCE(i+1,2n+2-i)\) in T. Recall that LCE queries in a string (over an integer alphabet) can be answered in \(\mathcal {O}(1)\) time after linear-time preprocessing [4].

Gusfield’s approach can be easily adapted to generalized palindromes: it suffices to apply LCE-queries on \(T=S\$f(S^R)\). To further simplify the description of this approach, we introduce the Longest Gapped Palindrome (LGPal) Queries, such that LGPal(ij) is the maximum k such that \(f(S[i-k+1 ..i]^R) = S[j ..j+k-1]\); see Fig. 1. As we have already noticed, LGPal-queries are equivalent to LCE-queries in \(T=S\$f(S^R)\).

Fig. 1.
figure 1

To find the longest complemented 1-palindrome under the Hamming distance centered at position 7.5 in \(S=\mathtt {GACATTCGAACGT}\), it suffices to ask two LGPal-queries: \(LGPal(7,8)=3\) finds the first mismatch, and LGPal(3, 12) extends the 1-palindrome after the mismatch. Note that each of these LGPal-queries is equivalent to an appropriate LCP-query in \(S\$f(S^R)\).

It is known (see [14]) that all maximal generalized \(\delta \)-palindromes under the Hamming distance can be computed in \(O(n \cdot \delta )\) time via at most \(\delta \) applications of the LGPal-query for each possible center position. Below we show how to compute maximal generalized \(\delta \)-palindromes under the edit distance within the same time complexity.

Recall that if u is a generalized \(\delta \)-palindrome under the edit distance, then there exists a generalized palindrome v such that the minimal number of edit operations (insertion, deletion, substitution) required to transform u to v is at most \(\delta \). The following simple observation shows that we can restrict ourselves to deletions and substitutions only, which we call in what follows the restricted edit operations. Intuitively, instead of inserting at position i a character to match the character at position \(|u|-i+1\), we can delete the character at position \(|u|-i+1\).

Observation 7

Let u be a generalized \(\delta \)-palindrome and v a generalized palindrome such that the edit distance between u and v is minimal. Then there exists a generalized palindrome \(v'\) such that the number of restricted edit operations needed to transform u to \(v'\) is equal to the edit distance between u and v.

We can extend a maximal generalized \(\delta \)-palindrome \(S[i ..j]\) to a maximal generalized \((\delta +1)\)-palindrome in three ways; either ignore the letter \(S[i-1]\) and then perform an LGPal-query, or ignore the letter \(S[j+1]\) and then perform an LGPal-query, or ignore both and then perform the LGPal-query. More formally:

Definition 8

Assume that \(S[i ..j]\) is a generalized \(\delta \)-palindrome. Then we say that each of the factors \(S[i' ..j']\) for:

  • \(i'=i-1-d\), \(j'=j+d\), where \(d=LGPal(i-2,j+1)\)

  • \(i'=i-d\), \(j'=j+1+d\), where \(d=LGPal(i-1,j+2)\)

  • \(i'=i-1-d\), \(j'=j+1+d\), where \(d=LGPal(i-2,j+2)\)

is an extension of \(S[i ..j]\). If the index \(i'\) is smaller than 1 or the index \(j'\) is greater than |S|, the corresponding extension is not possible. We also say that \(S[i ..j]\) can be extended to any of the three strings \(S[i' ..j']\).

Clearly, the extensions of a generalized \(\delta \)-palindrome are always generalized \((\delta +1)\)-palindromes.

To facilitate the case of \(\delta \)-palindromes being prefixes or suffixes of the text, we also introduce the following border-reductions for \(S[i ..j]\) being a generalized \(\delta \)-palindrome:

  • If \(i=1\), a border reduction leads to \(S[1 ..j-1]\).

  • If \(j=n\), a border reduction leads to \(S[i+1 ..n]\).

If any of the reductions is possible, we also say that \(S[i ..j]\) can be border-reduced to the corresponding strings. As previously, border-reductions of a generalized \(\delta \)-palindrome are always generalized \((\delta +1)\)-palindromes.

Lemma 9

Given a maximal generalized \(\delta \)-palindrome \(S[i' ..j']\) with \(\delta >0\), there exists a maximal generalized \((\delta -1)\)-palindrome \(S[i ..j]\) which can be extended or border-reduced to \(S[i' ..j']\).

Proof

Consider a shortest sequence of restricted edit operations that transforms \(u=S[i' ..j']\) into a generalized palindrome v. Let us consider the position where we perform a restricted edit operation that is closest to \(i'\) or \(j'\). Assume w.l.o.g. that this position—denote it by e—is not further to \(i'\) than to \(j'\).

Assume first that this edit operation is a substitution. Then \(S[i ..j]\), for \(i=e+1\) and \(j=j'-(e+1-i')\), is a generalized \((\delta -1)\)-palindrome (the witness generalized palindrome is the corresponding factor of v); see Fig. 2. Moreover, it is a maximal generalized \((\delta -1)\)-palindrome, as otherwise \(S[e]=S[i-1]\) would be equal to \(f(S[j+1])\), which means that the substitution at the position e would not be necessary. This completes the proof in this case.

Fig. 2.
figure 2

If the outermost restricted edit operation on \(S[i' ..j']\) is a substitution (from letter X to letter Y), then \(S[i' ..j']\) is an extension of the third type of the maximal generalized \((\delta -1)\)-palindrome \(S[i ..j]\).

Fig. 3.
figure 3

Three cases resulting when the outermost edit operation on \(S[i' ..j']\) is a deletion of a character X.

Now assume that the edit operation at the position e was a deletion. Let \(a=e+1\) and \(b=j'-(e-i')\). Again, we see that clearly \(S[a ..b]\) is a generalized \((\delta -1)\)-palindrome. If it is maximal, then we are done. Otherwise, consider the maximal generalized \((\delta -1)\)-palindrome \(S[i ..j]\) centered at the same position as \(S[a ..b]\) (\(a-i = j-b >0\)). Now we have three cases; see Fig. 3.

  1. 1.

    If \(j \le j'\), then we can obtain \(S[i' ..j']\) by an extension (of the first type) of \(S[i ..j]\); i.e. ignoring the letter \(S[i-1]\).

  2. 2.

    If \(j > j'\), then we have that \(S[i' ..j'+1]\) is a generalized \((\delta -1)\)-palindrome. If, additionally, \(i' > 1\), then \(S[i'-1 ..j'+1]\) is a generalized \(\delta \)-palindrome, which contradicts the maximality of \(S[i' ..j']\).

  3. 3.

    Finally, if \(j > j'\) and \(i'=1\), then \(i=1\), \(j=j'+1\). Hence, \(S[i' ..j']\) obtained from \(S[i ..j]\) by a border-reduction.

This completes the proof of the lemma.    \(\square \)

The combinatorial characterization of Lemma 9 yields an algorithm for generating all maximal generalized d-palindromes, for all centers and subsequent \(d=0,\ldots ,\delta \). Maximal generalized 0-palindromes are computed using Gusfield’s approach (LGPal-queries). For a given \(d < \delta \), we consider all the maximal generalized d-palindromes and try to extend each of them in all three possible ways (and border-reduce, if possible). This way we obtain a number of generalized \((d+1)\)-palindromes amongst which, by Lemma 9, are all maximal generalized \((d+1)\)-palindromes. To exclude the non-maximal ones, we group the generalized \((d+1)\)-palindromes by their centers (in \(\mathcal {O}(n)\) time via bucket sort) and retain only the longest one for each center. We arrive at the following intermediate result.

Lemma 10

Under the edit distance, all maximal generalized \(\delta \)-palindromes in a string of length n can be computed in \(\mathcal {O}(n \cdot \delta )\) time and \(\mathcal {O}(n)\) space.

5 Maximal Palindromic Decomposition with Gaps and Errors

Let \(\mathcal {F}\) be a set of factors of the text \(S[1 ..n]\). In this section we develop a general framework that allows to decompose S into factors from \(\mathcal {F}\), allowing at most g gaps. We call such a factorization a \((g,\mathcal {F})\)-factorization of S. Our goal is to find a \((g,\mathcal {F})\)-factorization of S that minimizes the total length of gaps. We aim at the time complexity \(\mathcal {O}((n+|\mathcal {F}|) \cdot g)\) and space complexity \(\mathcal {O}(n\cdot g+|\mathcal {F}|)\).

In our solution we use dynamic programming to compute two arrays, similar to the ones used in Sect. 3:

  • \(MG[1 ..n][0 ..g]\): MG[j][q] is the minimum total length of gaps in a \((q,\mathcal {F})\)-factorization of \(S[1 ..j]\).

  • \(MG'[1 ..n][0 ..g]\): \(MG'[j][q]\) is the minimum total length of gaps in a \((q,\mathcal {F})\)-factorization of \(S[1 ..j]\) for which the position j belongs to a gap.

We use the following formulas, for \(j>0\) and \(q>0\):

$$\begin{aligned} MG[j][q]&= \min (MG'[j][q],\min _{S[a ..j] \in \mathcal {F}} MG[a-1][q]) \\ MG'[j][q]&= \min (MG[j-1][q-1],MG'[j-1][q])+1 \end{aligned}$$

The border cases are exactly the same as in Sect. 3.

Clearly, the space complexity of this solution is \(\mathcal {O}(n\cdot g+|\mathcal {F}|)\). Let us analyse its time complexity. Fix \(q \in \{0,\ldots ,g\}\). The number of transitions using the factors from \(\mathcal {F}\) in the dynamic programming is \(|\mathcal {F}|\) in total, as each factor is used only for the position j where it ends. Hence, the formulas for MG[j][q] take \(\mathcal {O}(n\cdot g+|\mathcal {F}|\cdot g)\) time to evaluate. Computing the \(MG'[j][q]\) values takes \(\mathcal {O}(n\cdot g)\) time. Thus we arrive at the desired time complexity of \(\mathcal {O}((n+|\mathcal {F}|)\cdot g)\).

We apply this approach to maximal generalized \(\delta \)-palindromes in each of the considered metrics (see the classic result from [14] for the Hamming distance and Lemma 10 for the edit distance) to obtain the following result.

Theorem 11

The Generalized Maximal \(\delta \)-Palindromic Decomposition with Gaps problem under the Hamming distance or the edit distance can be solved in \(\mathcal {O}(n\cdot (g+ \delta ))\) time and \(\mathcal {O}(n\cdot g)\) space.

Example 12

Consider the following stringFootnote 1 of length 92:

GGACTCGGCTTGCTGAGGTGCACACAGCAAGAGGCGAGAGCGGCGACTGGTGAGTACGCCAAATTT

TGACTAGCGGAGGCTAGAAGGAGAGA

We have used our implementation of the algorithm from Theorem 11 to compute the decomposition of the string into maximal complemented 3-palindromes of length at least 14 under the edit distance with at most 4 gaps (\(g=4\), \(\delta =3\), \(m=14\)) with the minimal total gap length:

figure d

The gaps are given in square brackets. Edit operations are underlined, with deletes additionally given in italics. The gaps have total length 32.

In comparison, the optimal decomposition of this string under the Hamming distance with the same parameters (\(g=4\), \(\delta =3\), \(m=14\)) uses four gaps of total length 46.

6 Conclusions

We have presented two algorithms for finding palindromic decompositions: one allowing gaps and the other allowing both gaps in the decomposition and errors in palindromes. The first algorithm shows that (somewhat surprisingly) Fici et al.’s algorithm [8] for finding an exact palindromic factorization can be extended to handle gaps, a constraint on the palindromes length, and complements in palindromes as well. In the second algorithm we decompose a string into maximal palindromes with errors; the most involved part here was computing all such maximal palindromes under the edit distance.

In the problems that were defined in the beginning, the objective was to minimize the total length of gaps, allowing a certain number of gaps. However, the approaches that were presented in this paper can be used to solve different variants of the problems, like minimizing only the total number of gaps or maximizing the total length of palindromes, regardless of the number of gaps.

An open question is to efficiently compute decompositions into palindromes that may contain errors and are not necessarily maximal. This problem seems to be hard, as \(\delta \)-palindromes do not have such a strong combinatorial structure as palindromes without errors.