1 Introduction

In the Pattern Matching problem with Swaps (Swap Matching, for short), the goal is to find all occurrences of any swap version of a pattern P in a text T, where P and T are strings over an alphabet \(\varSigma \) of length p and t, respectively. By the swap version of a pattern P we mean a string of symbols created from P by swapping adjacent symbols while ensuring that each symbol is swapped at most once (see Sect. 2 for formal definitions). The solution of Swap Matching is a set of indices which represent where occurrences of P in T begin. Swap Matching is an intensively studied problem due to its use in practical applications such as text and music retrieval, data mining, network security and biological computing [5].

The swap of two consecutive symbols is one of the most typical typing errors. It also represent a simpler version of swaps that appear in nature. In particular, the phenomenon of swaps occurs in gene mutations and duplications such as in the region of human chromosome 5 that is implicated in the disease called spinal muscular Atrophy, a common recessive form of muscular dystrophy [15]. While the biological swaps occur at a gene level and have several additional constraints and characteristics, which make the problem much more difficult, they do serve as a convincing pointer to the theoretical study of swaps as a natural edit operation for the approximation metric [2]. Indeed Wagner and Lowrance [17] suggested to add the swap operation when considering the edit distance of two strings.

Swap Matching was introduced in 1995 as an open problem in non-standard string matching [16]. The first result was reported by Amir et al. [2] in 1997, who provided an \(O(t p^{\frac{1}{3}}\log p)\)-time solution for alphabets of size 2, while also showing that alphabets of size exceeding 2 can be reduced to size 2 with a little overhead. Amir et al. [4] also came up with solution with \(O(t \log ^2 p)\) time complexity for some very restrictive cases. Several years later Amir et al. [3] showed that Swap Matching can be solved by an algorithm for the overlap matching achieving the running time of \(O(t\log p \log |\varSigma |)\). This algorithm as well as all the previous ones is based on fast Fourier transformation (FFT).

In 2008 Iliopoulos and Rahman [14] introduced a new graph theoretic approach to model the Swap Matching problem and came up with the first efficient solution to Swap Matching without using FFT (we show it to be incorrect). Their algorithm based on bit parallelism runs in \(O((t+p)\log p)\) time if the pattern length is similar to the word-size of the target machine. One year later Cantone and Faro [8] presented the Cross Sampling algorithm solving Swap Matching in \(O(t)\) time and \(O(|\varSigma |)\) space, assuming that the pattern length is similar to the word-size in the target machine. In the same year Campanelli et al. [7] enhanced the Cross Sampling algorithm using notions from Backward directed acyclic word graph matching algorithm and named the new algorithm Backward Cross Sampling. This algorithm also assumes short pattern length. Although Backward Cross Sampling has \(O(|\varSigma |)\) space and \(O(tp)\) time complexity, which is worse than that of Cross Sampling, it improves the real-world performance.

In 2013 Faro [11] presented a new model to solve Swap Matching using reactive automata and also presented a new algorithm with \(O(t)\) time complexity assuming short patterns. The same year Chedid [10] improved the dynamic programming solution by Cantone and Faro [8] which results in more intuitive algorithms. In 2014 a minor improvement by Fredriksson and Giaquinta [12] appeared, yielding slightly (at most factor \(|\varSigma |\)) better asymptotic time complexity (and also slightly worse space complexity) for special cases of patterns. The same year Ahmed et al. [1] took ideas of the algorithm by Iliopoulos and Rahman [14] and devised two algorithms named Smalgo-I and Smalgo-II which both run in \(O(t)\) for short patterns, but bear the same error as the original algorithm.

Our Contribution. We design a simple algorithm which solves the Swap Matching problem. The goal is to design a streaming algorithm, which is given one symbol per each execution step until the end-of-input arrives, and thus does not need access to the whole input. This algorithm has \(O({\lceil \frac{p}{w}\rceil }(|\varSigma |+t) + p)\) time and \(O({\lceil \frac{p}{w}\rceil }|\varSigma |)\) space complexity where w is the word-size of the machine. We would like to stress that our solution, as based on the graph theoretic approach, does not use FFT. Therefore, it yields a much simpler non-recursive algorithm allowing bit parallelism and is not suffering from the disadvantages of the convolution-based methods. While our algorithm matches the best asymptotic complexity bounds of the previous results [8, 12] (up to a \(|\varSigma |\) factor), we believe that its strength lies in the applications where the alphabet is small and the pattern length is at most the word-size, as it can be implemented using only \(7+|\varSigma |\) CPU registers and few machine instructions. This makes it practical for tasks like DNA sequences scanning. Also, as far as we know, our algorithm is currently the only known streaming algorithm for the swap matching problem.

We continue by proving that any deterministic finite automaton that solves Swap Matching has number of states exponential in the length of the pattern.

We also describe the Smalgo (swap matching algorithm) by Iliopoulos and Rahman [14] in detail. Unfortunately, we have discovered that Smalgo and derived algorithms contain a flaw which cause false positives to appear. We have prepared implementations of Smalgo-I, Cross Sampling, Backward Cross Sampling and our own algorithm, measured the running times and the rate of false positives for the Smalgo-I algorithm. All of the sources are available for download.Footnote 1

This paper is organized as follows. First we introduce all the basic definitions, and also recall the graph theoretic model introduced in [14] and its use for matching in Sect. 2. In Sect. 3 we show our algorithm for Swap Matching problem and follow it in Sect. 4 with the proof that Swap Matching cannot be solved efficiently by deterministic finite automata. Then we describe the Smalgo in detail in Sect. 5 and finish with the experimental evaluation of the algorithms in Sect. 6.

2 Basic Definitions and the Graph Theoretic Model

In this section we state the basic definitions, present the graph theoretic model and show a basic algorithm that solves Swap Matching using the model.

2.1 Notations and Basic Definitions

We use the word-RAM as our computational model. That means we have access to memory cells of fixed capacity w (e.g., 64 bits). A standard set of arithmetic and bitwise instructions include And (\( \mathbin { \& }\)), Or (\(\mid \)), Left bitwise-shift (\(\text{ LShift }\)) and Right bitwise-shift (\(\text{ RShift }\)). Each of the standard operations on words takes single unit of time. In order to compare to other existing algorithms, which are not streaming, we define the access to the input in a less restrictive way – the input is read from a read-only part of memory and the output is written to a write-only part of memory. However, it will be easy to observe that our algorithm accesses the input sequentially. We do not include the input and the output into the space complexity analysis.

A string S over an alphabet \(\varSigma \) is a finite sequence of symbols from \(\varSigma \) and |S| is its length. By \(S_{i}\) we mean the i-th symbol of S and we define a substring \({S}_{[i,j]}=S_{i} S_{i+1} \dots S_{j}\) for \(1 \le i \le j \le |S|\), and prefix \({S}_{[1,i]}\) for \(1 \le i \le |S|\). String P prefix matches string T n symbols on position i if \(P_{[1,n]}=T_{[i,i+n-1]}\).

Next we formally introduce a swapped version of a string.

Definition 1

(Campanelli et al. [7]). A swap permutation for S is a permutation \(\pi : \{ 1, \dots , n \} \rightarrow \{ 1, \dots , n \}\) such that:

  1. (i)

    if \(\pi (i) = j\) then \(\pi (j) = i\) (symbols at positions i and j are swapped),

  2. (ii)

    for all \(i,\pi (i)\in \{i-1,i,i+1\}\) (only adjacent symbols are swapped),

  3. (iii)

    if \(\pi (i) \ne i\) then \(S_{\pi (i)} \ne S_i\) (identical symbols are not swapped).

For a string S a swapped version \(\pi (S)\) is a string \(\pi (S) = S_{\pi (1)} S_{\pi (2)} \dots S_{\pi (n)}\) where \(\pi \) is a swap permutation for S.

Now we formalize the version of matching we are interested in.

Definition 2

Given a text \(T = T_1 T_2 \dots T_t\) and a pattern \(P = P_1 P_2 \dots P_p\), the pattern P is said to swap match T at location i if there exists a swapped version \(\pi (P)\) of P that matches T at location i, i.e., \(\pi (P) = {T}_{[i,i+p-1]}\).

2.2 A Graph Theoretic Model

The algorithms in this paper are based on a model introduced by Iliopoulos and Rahman [14]. In this section we briefly describe this model.

Fig. 1.
figure 1

\(P\)-graph \(\mathcal {P}_{P}\) for the pattern \(P = abcbbac\)

For a pattern P of length p we construct a labeled graph \(\mathcal {P}_{P}=(V, E, \sigma )\) with vertices V, edges E, and a vertex labeling function \(\sigma : V \rightarrow \varSigma \) (see Fig. 1 for an example). Let \(V=V'\setminus \{ m_{-1,1} , m_{1,p} \} \) where \(V' = \{m_{r,c}\mid r \in \{-1,0,1\}, c \in \{1,2,\dots ,p\}\}\). For \(m_{r,c} \in V\) we set \(\sigma (m_{r,c})=P_{r+c}\). Each vertex \(m_{r,c}\) is identified with an element of a \(3 \times p\) grid. We set \(E' := E'_1 \cup E'_2 \cup \dots \cup E'_{p-1}\), where \(E'_j := \{(m_{k,j},m_{i,j+1}) \mid k \in \{-1,0\},i \in \{0, 1\}\}\cup \{(m_{1,j},m_{-1,j+1})\}\), and let \(E = E' \cap V \times V\). We call \(\mathcal {P}_{P}\) the P-graph. Note that \(\mathcal {P}_{P}\) is directed acyclic graph, \(|V(\mathcal {P}_{P})| = 3p-2\), and \(|E(\mathcal {P}_{P})| = 5(p-1)-4\).

The idea behind the construction of \(\mathcal {P}_{P}\) is as follows. We create vertices \(V'\) and edges \(E'\) which represent every swap pattern without unnecessary restrictions (equal symbols can be swapped). We remove vertices \(m_{-1,1}\) and \(m_{1,p}\) which represent symbols from invalid indices 0 and \(p+1\).

The \(P\)-graph now represents all possible swap permutations of the pattern P in the following sense. Vertices \(m_{0,j}\) represent ends of prefixes of swapped version of the pattern which end by a non-swapped symbol. Possible swap of symbols \(P_j\) and \(P_{j+1}\) is represented by vertices \(m_{1,j}\) and \(m_{-1,j+1}\). Edges represent symbols which can be consecutive. Each path from column 1 to column p represents a swap pattern and each swap pattern is represented this way.

Definition 3

For a given \(\varSigma \)-labeled directed acyclic graph \(G=(V,E,\sigma )\) vertices \(s,e \in V\) and a directed path \(f=v_1,v_2,\dots ,v_k\) from \(v_1=s\) to \(v_k=e\), we call \(S = \sigma (f) = \sigma (v_1) \sigma (v_2) \dots \sigma (v_k) \in \varSigma ^*\) a path string of f.

2.3 Using Graph Theoretic Model for Matching

In this section we describe an algorithm called Basic Matching Algorithm (BMA) which can determine whether there is a match of pattern P in text T on a position k using any graph model which satisfies the following conditions.

  • It is a directed acyclic graph,

  • \(V = V_1 \uplus V_2 \uplus \dots \uplus V_p\) (we can divide vertices to columns),

  • \(E = \{ (u, w) \mid u \in V_i , w \in V_{i+1} , 1 \le i < p \}\) (edges lead to next column).

Let \(Q_0 = V_1\) be the starting vertices and \(F = V_p\) be the accepting vertices. BMA is designed to run on any graph which satisfies these conditions. Since \(P\)-graph satisfies these assumptions we can use BMA for \(\mathcal {P}_{P}\).

figure a

The algorithm runs as follows (see also Algorithm 1). We initialize the algorithm by setting \(D_{1}' := Q_0\) (Step 1). \(D_{1}'\) now holds information about vertices which are the end of some path f starting in \(Q_0\) for which \(\sigma (f)\) possibly prefix matches 1 symbol of \({T}_{[k,k+p-1]}\). To make sure that the path f represents a prefix match we need to check whether the label of the last vertex of the path f matches the symbol \(T_k\) (Step 3). If no prefix match is left we did not find a match (Step 4). If some prefix match is left we need to check whether we already have a complete match (Step 5). If the algorithm did not stop it means that we have some prefix match but it is not a complete match yet. Therefore we can try to extend this prefix match by one symbol (Step 6) and check whether it is a valid prefix match (Step 3). Since we extend the matched prefix in each step, we repeat these steps until the prefix match is as long as the pattern (Step 2).

Having vertices in sets is not handy for computing so we present another way to describe this algorithm. We use their characteristic vectors instead.

Definition 4

A Boolean labeling function \(I^{} : V \rightarrow \{ 0,1 \}\) of vertices of \(\mathcal {P}_{P}\) is called a prefix match signal.

The algorithm can be easily divided into iterations according to the value of i in Step 2. We denote the value of the prefix match signal in \(j\text {-th}\) iteration as \(I^{j}\) and we define the following operations:

  • propagate signal along the edges, is an operation which sets \(I^{j}(v) := 1\) if and only if there exists an edge \((u,v) \in E\) with \(I^{j-1}(u) = 1\),

  • filter signal by a symbol \(x \in \varSigma \), is an operation which sets \(I^{}(v) := 0\) for each v where \(\sigma (v) \ne x\),

  • match check, is an operation which checks whether there exists \(v \in F\) such that \(I(v) = 1\) and if so reports a match.

With these definitions in hand we can describe BMA in terms of prefix match signals as Algorithm 2. See Fig. 2 for an example of use of BMA to figure out whether \(P = acbab\) swap matches \(T = babcabc\) at a position 2.

figure b
Fig. 2.
figure 2

BMA of \({T}_{[2,6]} = abcab\) on a \(P\)-graph of the pattern \(P = acbab\). The prefix match signal propagates along the dashed edges. Index j above a vertex v represent that \(I^{j}(v) = 1\), otherwise \(I^{j}(v) = 0\).

2.4 Shift-And Algorithm

The following description is based on [9, Chap. 5] describing the Shift-Or algorithm.

For a pattern P and a text T of length p and t, respectively, let \(R{}^{}_{}\) be a bit array of size p and \(R{}^{j}_{}\) its value after text symbol \(T_j\) has been processed. It contains information about all matches of prefixes of P that end at the position j in the text. For \(1 \le i \le p\), \(R{}^{j}_{i} = 1\) if \({P}_{[1,i]} = {T}_{[j-i+1,j]}\) and 0 otherwise. The vector \(R{}^{j+1}_{}\) can be computed from \(R{}^{j}_{}\) as follows. For each positive i we have \(R{}^{j+1}_{i+1} = 1\) if \(R{}^{j}_{i} = 1\) and \(P_{i+1} = T_{j+1}\), and \(R{}^{j+1}_{i+1} = 0\) otherwise. Furthermore, \(R{}^{j+1}_{1} = 1\) if \(P_{1} = T_{j+1}\) and 0 otherwise. If \(R{}^{j+1}_{p} = 1\) then a complete match can be reported.

The transition from \(R{}^{j}_{}\) to \(R{}^{j+1}_{}\) can be computed very fast as follows. For each \(x \in \varSigma \) let \(D^{x}_{}\) be a bit array of size p such that for \(1 \le i \le p, D^{x}_{i} = 1\) if and only if \(P_i = x\). The array \(D^{x}_{}\) denotes the positions of the symbol x in the pattern P. Each \(D^{x}_{}\) can be preprocessed before the search. The computation of \(R{}^{j+1}_{}\) is then reduced to three bitwise operations, namely \( R{}^{j+1}_{} = (\text{ LShift }(R{}^{j}_{}) \mid 1) \mathbin { \& }D^{T_{j+1}}_{}\). When \(R{}^{j}_{p} = 1\), the algorithm reports a match on a position \(j - p + 1\).

3 Our Algorithm

In this section we will show an algorithm which solves Swap Matching. We call the algorithm GSM (Graph Swap Matching). GSM uses the graph theoretic model presented in Sect. 2.2 and is based on the Shift-And algorithm from Sect. 2.4.

The basic idea of the GSM algorithm is to represent prefix match signals (see Definition 4) from the basic matching algorithm (Sect. 2.3) over \(\mathcal {P}_{P}\) in bit vectors. The GSM algorithm represents all signals I in the bitmaps \({RX}^{}_{}\) formed by three vectors, one for each row. Each time GSM processes a symbol of T, it first propagates the signal along the edges, then filters the signal and finally checks for matches. All these operations can be done very quickly thanks to bitwise parallelism.

First, we make the concept of GSM more familiar by presenting a way to interpret the Shift-And algorithm by means of the basic matching algorithm (BMA) from Sect. 2.3 to solve the (ordinary) Pattern Matching problem. Then we expand this idea to Swap Matching by using the graph theoretic model.

3.1 Graph Theoretic View of the Shift-And Algorithm

Let T and P be a text and a pattern of lengths t and p, respectively. We create the \(T\)-graph \(\mathcal {T}_{P} = (V,E,\sigma )\) of the pattern P.

Definition 5

Let S be a string. The T-graph of S is a graph \(\mathcal {T}_{S} = (V,E,\sigma )\) where \(V = \{ v_i \mid 1 \le i \le |S| \}\), \(E = \{ (v_i,v_{i+1}) \mid 1 \le i \le |S-1| \}\) and \(\sigma : V \rightarrow \varSigma \) such that \(\sigma (v_i) = S_i\).

We know that the \(T\)-graph is directed acyclic graph which can be divided into columns \(V_i, 1 \le i \le p\) (each of them containing one vertex \(v_i\)) such that the edges lead from \(V_j\) to \(V_{j+1}\). This means that the \(T\)-graph satisfies all assumptions of BMA. We apply BMA to \(\mathcal {T}_{P}\) to figure out whether P matches T at a position j. We get a correct result because for each \(i \in \{1,\ldots ,p\}\) we check whether \(T_{j+i-1} = \sigma (v_i) = P_i\).

To find every occurrence of P in T we would have to run BMA for each position separately. This is basically the naive approach to solve the pattern matching. We can improve the algorithm significantly when we parallelize the computations of p runs of BMA in the following way.

The algorithm processes one symbol at a time starting from \(T_1\). We say that the algorithm is in the \(j\text {-th}\) step when a symbol \(T_j\) has been processed. BMA represents a prefix match as a prefix match signal \(I^{} : V \rightarrow \{ 0, 1 \}\). Its value in the \(j\text {-th}\) step is denoted \(I^{j}\). Since one run of the BMA uses only one column of the \(T\)-graph at any time we can use other vertices to represent different runs of the BMA. We represent all prefix match indicators in one vector so that we can manipulate them easily. To do that we prepare a bit vector \(R{}^{}_{}\). Its value in \(j\text {-th}\) step is denoted \(R{}^{j}_{}\) and defined as \(R{}^{j}_{i} = I^{j}(v_i)\).

First operation which is used in BMA (propagate signal along the edges) can be done easily by setting the signal of \(v_i\) to value of the signal of its predecessor \(v_{i-1}\) in the previous step. I.e., for \(i \in \{ 1, \dots , p \}\) we set \(I^{j}(v_i) = 1\) if \(i=1\) and \(I^{j}(v_i) = I^{j-1}(v_{i-1})\) otherwise. In terms of \(R^j\) this means just \(R^j=\text{ LSO }(R^{j-1})\), where LSO is defined as \(\text{ LSO }(x) = \text{ LShift }(x) \mid 1\).

We also need a way to set \(I^{}(v_i) := 0\) for each \(v_i\) for which \(\sigma (v_i) \ne T_{j+i}\) which is another basic BMA operation (filter signal by a symbol). We can do this using the bit vector \(D^{x}_{}\) from Sect. 2.4 and taking \( R{}^{}_{} \mathbin { \& }D^{x}_{}\). I.e., the algorithm computes \(R^j\) as \( R{}^{j}_{} = \text{ LSO }(R{}^{j-1}_{}) \mathbin { \& }D^{T_{j+1}}_{}\).

The last BMA operation we have to define is the match detection. We do this by checking whether \(R{}^{j}_{p} = 1\) and if this is the case then a match starting at position \(j-p+1\) occurred.

3.2 Our Algorithm for Swap Matching Using the Graph Theoretic Model

Now we are ready to describe the GSM algorithm.

We again let \(\mathcal {P}_{P} = (V,E,\sigma )\) be the \(P\)-graph of the pattern P, apply BMA to \(\mathcal {P}_{P}\) to figure out whether P matches T at a position j, and parallelize p runs of BMA on \(\mathcal {P}_{P}\).

Again, the algorithm processes one symbol at a time and it is in the \(j\text {-th}\) step when a symbol \(T_j\) is being processed. We again denote the value of the prefix match signal \(I^{}: V \rightarrow \{ 0, 1 \}\) of BMA in the \(j\text {-th}\) step by \(I^{j}\). I.e., the semantic meaning of \(I^{j}(m_{r,c})\) is that \(I^{j}(m_{r,c}) = 1\) if there exists a swap permutation \(\pi \) such that \(\pi (c)=c+r\) and \({\pi (P)}_{[1,c]} = {T}_{[j-c+1,j]}\). Otherwise \(I^{j}(m_{r,c})\) is 0.

We want to represent all prefix match indicators in vectors so that we can manipulate them easily. We can do this by mapping the values of I for rows \(r \in \{-1,0,1\}\) of the \(P\)-graph to vectors \({RU}^{}_{},{RM}^{}_{}\), and \({RD}^{}_{}\), respectively. We denote value of the vector \({RX}^{}_{} \in \{ {RU}^{}_{},{RM}^{}_{},{RD}^{}_{} \}\) in \(j\text {-th}\) step as \({RX}^{j}_{}\). We define values of the vectors as \({RU}^{j}_{i} = I^{j}(m_{-1,i})\), \({RM}^{j}_{i} = I^{j}(m_{0,i})\), and \({RD}^{j}_{i} = I^{j}(m_{1,i})\), where the value of \(I^{j}(v)=0\) for every \(v\notin V\).

We define BMA propagate signal along the edges operation as setting the signal of \(m_{r,c}\) to 1 if at least one of its predecessors have signal set to 1. I.e., we set \(I^{j+1}(m_{-1,i}) := I^{j}(m_{1,i-1})\), \(I^{j+1}(m_{0,i}) := I^{j}(m_{-1,i-1}) \mid I^{j}(m_{0,i-1})\), \(I^{j+1}(m_{0,1}) := 1\), \(I^{j+1}(m_{1,i}) := I^{j}(m_{-1,i-1}) \mid I^{j}(m_{0,i-1})\), and \(I^{j+1}(m_{1,1}) := 1\). We can perform the above operation using the \(\text{ LSO }(R{}^{}_{})\) operation. We obtain the propagate signal along the edges operation in the form \({RU'}^{j+1}_{} := \text{ LSO }({RD}^{j}_{})\), \({RM'}^{j+1}_{} := \text{ LSO }({RM}^{j}_{} \mid {RU}^{j}_{})\), and \({RD'}^{j+1}_{} := \text{ LSO }({RM}^{j}_{} \mid {RU}^{j}_{})\).

The operation filter signal by a symbol can be done by first constructing a bit vector \(D^{x}_{}\) for each \(x \in \varSigma \) as \(D^{x}_{i}=1\) if \(x=P_i\) and \(D^{x}_{i}=0\) otherwise. Then we use these vectors to filter signal by a symbol x by taking \( {RU}^{j}_{} := {RU'}^{j}_{} \mathbin { \& }\text{ LShift }(D^{T_j})\), \( {RM}^{j}_{} := {RM'}^{j}_{} \mathbin { \& }D^{T_j}\), and \( {RD}^{j}_{} := {RD'}^{j}_{} \mathbin { \& }\text{ RShift }(D^{T_j})\).

The last operation we define is the match detection. We do this by checking whether \({RU}^{j}_{p}=1\) or \({RM}^{j}_{p}=1\) and if this is the case, then a match starting at a position \(j-p+1\) occurred.

figure c

The final GSM algorithm (Algorithm 3) first prepares the D-masks  \(D^{x}_{}\) for every \(x \in \varSigma \) and initializes \({RU}^{0}_{}:={RM}^{0}_{}:={RD}^{0}_{}:=0\) (Steps 1–3). Then the algorithm computes the value of vectors \({RU}^{j}_{}\), \({RM}^{j}_{}\), and \({RD}^{j}_{}\) for \(j \in \{ 1,\dots ,t \}\) by first using the above formula for signal propagation (Steps 5 and 6) and then the formula for signal filtering (Step 7) and checks whether \({RU}^{j}_{p}=1\) or \({RM}^{j}_{p}=1\) and if this is the case the algorithm reports a match (Step 8).

Observe that Algorithm 3 accesses the input sequentially and thus it is a streaming algorithm. We now prove correctness of our algorithm. To ease the notation let us define \(R^j(m_{r,c})\) to be \({RU}^{j}_{c}\) if \(r=-1\), \({RM}^{j}_{c}\) if \(r=0\), and \({RD}^{j}_{c}\) if \(r=1\). We define \(R'^j(m_{r,c})\) analogously. Similarly, we define \(D^x(m_{r,c})\) as \((\text{ LShift }(D^x))_c=D^x_{c-1}\) if \(r=-1\), \(D^x_c\) if \(r=0\), and \((\text{ RShift }(D^x))_c=D^x_{c+1}\) if \(r=1\). By the way the masks \(D^x\) are computed on lines 2 and 3 of Algorithm 3, we get the following observation.

Observation 1

For every \(m_{r,i} \in V\) and every \(j \in \{i, \ldots t\}\) we have \(D^{T_j}(m_{r,i})=1\) if and only if \(T_j=P_{r+i}\).

The following lemma constitutes the crucial part of the correctness proof.

Lemma 1

For every \(m_{r,i} \in V\) and every \(j \in \{i, \ldots t\}\) we have \(R^j(m_{r,i})=1\) if and only if there exists a swap permutation \(\pi \) such that \({\pi (P)}_{[1,i]} = {T}_{[j-i+1,j]}\) and \(\pi (i)=i+r\).

Due to space constraints, we defer the proof of this lemma to the full version of this paper, see also its ArXiv version [6].

Our GSM algorithm reports a match on position \(j-p+1\) if and only if \(R^j(m_{p,-1})=1\) or \(R^j(m_{p,0})=1\). However, by Lemma 1, this happens if and only if there is a swap match of P on position \(j-p+1\) in T. Hence, the algorithm is correct.

Theorem 2

The GSM algorithm runs in \(O({\lceil \frac{p}{w}\rceil }(|\varSigma | + t) + p)\) time and uses \(O({\lceil \frac{p}{w}\rceil }|\varSigma |)\) memory cells (not counting the input and output cells), where t is the length of the input text, p length of the input pattern, w is the word-size of the machine, and \(|\varSigma |\) size of the alphabet.Footnote 2

Proof

The initialization of \({RX}^{}_{}\) and \(D^{x}_{}\) masks (lines 1 and 2) takes \(O({\lceil \frac{p}{w}\rceil }|\varSigma |)\) time. The bits in \(D^{x}_{}\) masks are set according to the pattern in O(p) time (line 3). The main cycle of the algorithm (lines 4–8) makes t iterations. Each iteration consists of computing values of \({RX}^{}_{}\) in 13 bitwise operations, i.e., in \(O({\lceil \frac{p}{w}\rceil })\) machine operations, and checking for the result in O(1) time. This gives \(O({\lceil \frac{p}{w}\rceil }(|\varSigma | + t) + p)\) time in total. The algorithm saves 3 \({RX}^{}_{}\) masks (using the same space for all j and also for \({RX'}^{}_{}\) masks), \(|\varSigma |\) \(D^{x}_{}\) masks, and constant number of variables for other uses (iteration counters, temporary variable, etc.). Thus, in total the GSM algorithm needs \(O({\lceil \frac{p}{w}\rceil }|\varSigma |)\) memory cells.    \(\square \)

Corollary 1

If \(p = c w\) for some constant c, then the GSM algorithm runs in \(O(|\varSigma | + p + t)\) time and has \(O(|\varSigma |)\) space complexity. Moreover, if \(p \le w\), then the GSM algorithm can be implemented using only \(7 + |\varSigma |\) memory cells.

Proof

The first part follows directly from Theorem 2. Let us show the second part. We need \(|\varSigma |\) cells for all D-masks, 3 cells for \({R}^{}_{}\) vectors (reusing the space also for \(R^{\prime }\) vectors), one pointer to the text, one iteration counter, one constant for the match check and one temporary variable for the computation of the more complex parts of the algorithm. Altogether, we need only \(7 + |\varSigma |\) memory cells to run the GSM algorithm.    \(\square \)

From the space complexity analysis we see that for some sufficiently small alphabets (e.g. DNA sequences) the GSM algorithm can be implemented in practice using solely CPU registers with the exception of text which has to be loaded from the RAM.

4 Limitations of the Finite Deterministic Automata Approach

Many of the string matching problems can be solved by finite automata. The construction of a non-deterministic finite automaton that solves Swap Matching can be done by a simple modification of the \(P\)-graph. An alternative approach to solve the Swap Matching would thus be to determinize and execute this automaton. The drawback is that the determinization process may lead to an exponential number of states. We show that in some cases it actually does, contradicting the conjecture of Holub [13], stating that the number of states of this determinized automaton is O(p).

Theorem 3

There is an infinite family F of patterns such that any deterministic finite automaton \(A_P\) accepting the language \(L_S(P)=\{u\pi (P) \mid u \in \varSigma ^*, \pi \text { is a swap permutation for } P\}\) for \(P \in F\) has \(2^{\varOmega (|P|)}\) states.

Table 1. An example of the construction from proof of Theorem 3 for \(k=3\).

Proof

For any integer k we define the pattern \(P_k := ac(abc)^k\). Note that the length of \(P_k\) is \(\varTheta (k)\). Suppose that the automaton \(A_P\) recognizing language L(P) has s states such that \(s < 2^k\). We define a set of strings \(T_0, \dots , T_{2^k-1}\) where \(T_i\) is defined as follows. Let \(b^i_{k-1}, b^i_{k-2} \dots b^i_0\) be the binary representation of the number i. Let \(B^i_j=abc\) if \(b^i_j = 0\) and let \(B^i_j=bac\) if \(b^i_j = 1\). Then, let \(T_i := ac B^i_{k-1} B^i_{k-2} \dots B^i_0\). See Table 1 for an example. Since \(s < 2^k\), there exist \(0 \le i < j \le 2^k-1\) such that both \(T_i\) and \(T_j\) are accepted by the same accepting state q of the automaton A. Let m be the minimum number such that \(b^i_{k-1-m} \ne b^j_{k-1-m}\). Note that \(b^i_m = 0\) and \(b^j_m = 1\). Now we define \(T_i'=T_i (abc)^{(m+1)}\) and \(T_j'=T_j (abc)^{(m+1)}\). Let \(X=(T_i')_{[3(m+1)+1,3(m+1+k)+2]}\) and \(Y=(T_j')_{[3(m+1)+1,3(m+1+k)+2]}\) be the suffices of the strings \(T_i'\) and \(T_j'\) both of length \(3k+2\). Note that X begins with \(bc\dots \) and Y begins with \(ac\dots \) and that block abc or bac repeats for k times in both. Therefore pattern P swap matches Y and does not swap match X. Since for the last symbol of both \(T_i\) and \(T_j\) the automaton is in the same state q, the computation for \(T_i'\) and \(T_j'\) must end in the same state \(q'\). However as X should not be accepted and Y should be accepted we obtain contradiction with the correctness of the automaton A. Hence, we may define the family F as \(F=\{ P_1, P_2, \dots \}\), concluding the proof.    \(\square \)

This proof shows the necessity for specially designed algorithms which solve the Swap Matching. We presented one in the previous section and now we reiterate on the existing algorithms.

5 Smalgo Algorithm

In this section we discuss how Smalgo by Iliopoulos and Rahman [14] and Smalgo-I by Ahmed et al. [1] work. Since they are bitwise inverse of each other, we will introduce them both in terms of operations used in Smalgo-I.

Before we show how these algorithms work, we need one more definition.

Definition 6

A degenerate symbol w over an alphabet \(\varSigma \) is a nonempty set of symbols from alphabet \(\varSigma \). A degenerate string S is a string built over an alphabet of degenerate symbols. We say that a degenerate string \(\widetilde{P}\) matches a text T at a position j if \(T_{j+i-1} \in \widetilde{P_i}\) for every \(1 \le i \le p\).

5.1 Smalgo-I

The Smalgo-I [1] is a modification of the Shift-And algorithm from Sect. 2.4 for Swap Matching. The algorithm uses the graph theoretic model introduced in Sect. 2.2.

First let \(\widetilde{P} = \{P_{1}, P_{2} \} \dots \{P_{x-1}, P_{x}, P_{x+1} \} \dots \{ P_{p-1}, P_{p} \}\) be a degenerate version of pattern P. The symbol in \(\widetilde{P}\) on position i represents the set of symbols of P which can swap to that position. To accommodate the Shift-And algorithm to match degenerate patterns we need to change the way the \(D^{x}_{}\) masks are defined. For each \(x \in \varSigma \) let \(\widetilde{D}^x_i\) be the bit array of size p such that for \(1 \le i \le p, \widetilde{D}^x = 1\) if and only if \(x \in \widetilde{P}_i\).

While a match of the degenerate pattern \(\widetilde{P}\) is a necessary condition for a swap match of P, it is clearly not sufficient. The way the Smalgo algorithms try to fix this is by introducing P-mask \(P(x_1,x_2,x_3)\) which is defined as \(P(x_1,x_2,x_3)_i = 1\) if \(i = 1\) or if there exist vertices \(u_1,u_2\), and \(u_3\) and edges \((u_1,u_2), (u_2,u_3)\) in \(\mathcal {P}_{P}\) for which \(u_2 = m_{r,i}\) for some \(r \in \{-1,0,1\}\) and \(\sigma (u_n) = x_n\) for \(1 \le n \le 3\), and \(P(x_1,x_2,x_3)_i = 0\) otherwise. One P-mask called P(xxx) is used to represent the P-masks for triples \((x_1,x_2,x_3)\) which only contain 1 in the first column.

Now, whenever checking whether P prefix swap matches T \(k+1\) symbols at position j we check for a match of \(\widetilde{P}\) in T and we also check whether \(P(T_{j+k-1},T_{j+k},T_{j+k+1})_{k+1} = 1\). This ensures that the symbols are able to swap to respective positions and that those three symbols of the text T are present in some \(\pi (P)\).

With the P-masks completed we initialize \( R{}^{1}_{}=1 \mathbin { \& }\widetilde{D}^{T_1}\). Then for every \(j=1\) to t we repeat the following. We compute \(R{}^{j+1}_{}\) as \( R{}^{j+1}_{} = \text{ LSO }(R{}^{j}_{}) \mathbin { \& }\widetilde{D}^{T_{j+1}} \mathbin { \& }\text{ RShift }(\widetilde{D}^{T_{j+2}}) \mathbin { \& }P(T_j, T_{j+1}, T_{j+2})\). To check whether or not a swap match occurred we check whether \(R{}^{j}_{p-1} = 1\). This is claimed to be sufficient because during the processing we are in fact considering not only the next symbol \(T_{j+1}\) but also the symbol \(T_{j+2}\).

5.2 The Flaw in the Smalgo, Smalgo-I and Smalgo-II

We shall see that for a pattern \(P = abab\) and a text \(T = aaba\) all Smalgo versions give false positives.

The concept of Smalgo is based on the assumption that we can find a path in \(\mathcal {P}_{P}\) by searching for consecutive paths of length 3 (triplets), where each two consecutive share two columns and can partially overlap. However, this only works if the consecutive triplets actually share the two vertices in the common columns. If the assumption is not true then the found substring of the text might not match any swap version of P.

The above input gives such a configuration and therefore the assumption is false. The Smalgo-I algorithm actually reports match of pattern \(P = abab\) on a position 1 of text \(T = aaba\). This is obviously a false positive, as the pattern has two b symbols while the text has only one.

The reason behind the false positive match is as follows. The algorithm checks whether the first triplet of symbols (aab) matches. It can match the swap pattern aabb. Next it checks the second triplet of symbols (aba), which can match baba. We know that baba is not possible since it did not appear in the previous check, but the algorithm cannot distinguish them since it only checks for triplets existence. Since each step gave us a positive match the algorithm reports a swap match of the pattern in the text.

In the Fig. 3 we see the two triplets which Smalgo assumes have two vertices in common. The Smalgo-II algorithm saves space by maintaining less information, however it simulates how Smalgo-I works and so it contains the same flaw. The ArXiv version of our paper [6] provides more details on the execution of Smalgo-I algorithm on pattern \(P = abab\) and text \(T = aaba\) and also a detailed analysis of the Smalgo-II algorithm.

Fig. 3.
figure 3

Smalgo flaw represented in the \(P\)-graph for \(P = abab\)

6 Experiments

We implemented our Algorithm 3 (GSM), described in Sect. 3.2, the Bitwise Parallel Cross Sampling (BPCS) algorithm by Cantone and Faro [8], the Bitwise Parallel Backward Cross Sampling (BPBCS) algorithm by Campanelli et al. [7], and the faulty SMALGO algorithm by Iliopoulos and Rahman [14]. All these implementations are available online. (see footnote 1)

We tested the implementations on three real-world datasets. The first dataset (CH) is the 7th chromosome of the human genomeFootnote 3 which consists of 159 M characters from the standard ACTG nucleobases and N as for non-determined. Second dataset (HS) is a partial genome of Homo sapiens from the Protein CorpusFootnote 4 with 3.3 M characters representing proteins encoded in 19 different symbols. The last dataset (BIB) is the Bible text of the Cantenbury CorpusFootnote 5 with 4.0 M characters containing 62 different symbols. For each length from 3, 4, 5, 6, 8, 9, 10, 12, 16, and 32 we randomly selected 10,000 patterns from each text and processed each of them with each implemented algorithm.

All measured algorithms were implemented in C++ and compiled with -O3 in gcc 6.3.0. Measurements were carried on an Intel Core i7-4700HQ processor with 2.4 GHz base frequency and 3.4 GHz turbo with 8 GiB of DDR3 memory at 1.6 GHz. Time was measured using std::chrono::high_resolution_clock::now() from the C++ chrono library. The resulting running times, shown in Table 2, were averaged over the 10,000 patterns of the given length.

Table 2. Comparison of the running times. Each value is the average over 10,000 patterns randomly selected from the text in milliseconds.
Table 3. Found occurrences across datasets: The value is simply the sum of occurrences over all the patterns.

The results show, that the GSM algorithm runs approximately \(23\%\) faster than Smalgo (ignoring the fact that Smalgo is faulty by design). Also, the performance of GSM and BPCS is almost indistinguishable and according to our experiments, it varies in the span of units of percents depending on the exact CPU, cache, RAM and compiler setting. The seemingly superior average performance of BPBCS is caused by the heuristics BPBCS uses; however, while the worst-case performance of GSM is guaranteed, the performance of BPBCS for certain patterns is worse than that of GSM. Also note that GSM is a streaming algorithm while the others are not.

Table 3 visualizes the accurateness of Smalgo-I with respect to its flaw by comparing the number of occurrences found by the respective algorithms. The ratio of false positives to true positives for the Smalgo-I was: CH \(2.17\%\), HS \(0.20\%\) and BIB \(0.002\%\).