Keywords

1 Introduction

Nowadays, sequence data mining is a very interesting field of research that is going to be central in the next years due to the growth of the so called “Big Data” challenge. Moreover, available data in different application fields consist in sequences (for example over time or space) of generic objects. Generally speaking, given a set of sequences defined over a particular domain, a data mining problem consists in searching for possible frequent subsequences (patterns), relying on inexact matching procedures. In this work we propose a possible solution for the so called approximate subsequence mining problem, in which we admit some noise in the matching process. As an instance, in computational biology, searching for recurrent patterns is a critical task in the study of DNA, aiming to identify some genetic mutations or to classify proteins according to some structural properties. Sometimes the process of pattern extraction returns sequences that differ from the others in a few positions. Consequently, the choice of an adequate dissimilarity measure becomes a critical issue when we are designing an algorithm able to deal with this kind of problems. Handling sequences of objects is another challenging aspect, especially when the data mining task is defined over a structured domain of sequences [1, 2] Thinking data mining algorithms as a building block of a wider system facing a classification task, a reasonable way to treat complex sequential data is to map sequences to \(\mathbb {R}^d\) vectors by means of some feature extraction procedures in order to use classification techniques that deal with real valued vectors as input data [37]. The Granular Computing (GrC) approach [8] offers a valuable framework to fill the gap between the input sequence domain and the features space \(\mathbb {R}^d\) and relies on the so-called information granules that play the role of indistinguishable features at a particular level of abstraction adopted for system description. The main objective of Granular modeling consists in finding the correct level of information granulation that best describes the input data [9].

2 Frequent Substructures Mining and Matching Problem

The problem of sequential patterns mining was first introduced by Agrawal and Srikant [10] in a specific context: starting from a dataset of sequences of customer transactions, the objective consists in mining for sequential patterns in such dataset. In a dataset of sequences of customer transactions, the general object \(\alpha _i\) of each sequence consists of the following fields: customer-id, transaction-time and the set of items purchased in the transaction. Agrawal et al. [10] introduce for the first time the notion of itemset as a non-empty set of items. This problem is often viewed as the discovery of “association rules”, that is strictly dependent on the task of mining frequent itemsets. In [11], the authors propose the very first algorithm able to generate significant association rules between items in databases. Manager of supermarkets as well as e-commerce websites have to make decisions about which products to put on sale, how to design coupons and customize the offers in order to maximize their profits. This problem raises the need to analyze past transactions and predict future behaviors.

All the studies in this field are based on the notion of market-basket model of data [12]. It is used to describe relationship between items and baskets, also called “transactions”. Each basket consists in an itemset and it is assumed that the number of items in a basket is much smaller than the total number of items.

The market-basket model (also known as a priori-like) asserts that each itemset cannot be frequent if its items are not frequent or equivalently any super-pattern of infrequent patterns cannot be frequent. Using this principle Agrawal and Srikant proposed the AprioriAll algorithm in [10]. Their approach aims to extract frequent sequential patterns and is based on a candidate generation and test paradigm. Note that during the mining procedure, candidate frequent sequential patterns can be obtained only by joining shorter frequent sequential patterns. An example of a sequential pattern is “5 % of customers bought {Apple, Orange, Flour, Coffe} in one transaction, followed by {Coffee, Sugar} in a later transaction”. The weakness of the algorithm is that a huge set of candidate sequences are generated requiring an enormous amount of memory and many repeated database scans. This behavior gets worse with increasing size of sequences in the database.

In [13] a new algorithm named GSP (Generalized Sequential Patterns) is introduced. The authors propose a breadth-first search and bottom-up method to obtain the frequent sequential pattern. Moreover, they introduce a time constraint that fixes the minimum and maximum delay between adjacent elements in the candidate patterns and the possibility for items to be present in a set of transactions in a fixed time window. GSP overcomes the performances of the APrioriAll algorithm [10] reducing the number of candidate sequential patterns. However, all a priori-like sequential pattern mining methods tend to behave badly with large datasets, because they may generate a large set of candidate subsequences. Moreover, for such algorithms, multiple scans of the database are needed, one for each length of the candidate patterns and this becomes very time consuming for mining long patterns. Finally, another problem occurs with long sequential patterns: a combinatorial number of subsequences are generated and tested.

To overcome these problems, in [14], a new algorithm named SPADE is introduced. The authors use a similar approach of GSP, however they use a vertical data format and divide the mining problem into smaller sub-problems reducing significantly the number of database scans required. In [15, 16] the authors introduce two algorithms FreeSpan and PrefixSpan. They are based on a completely different approach than APrioriAll and GSP: the pattern-growth approach for mining sequential patterns in large datasets. Each time new sequential patterns are generated, the whole dataset of sequences is projected into a set of smallest projected datasets using the extracted sequential patterns and bigger sequential patterns are grown in each projected dataset analyzing only locally frequent fragments. PrefixSpan introduces new techniques to reduce the size of the projected datasets.

All presented works describe search techniques for mining non-contiguous sequences of objects. However, these approaches are not ideal when the objective is to extract frequent sequential patterns, in which the contiguity of the component objects plays a fundamental role in the information extraction.

In particular, in computation biology, even though techniques for mining sequential noncontiguous patterns have many uses, they are not appropriate for many applications. Computational biology community has developed a lot of methods for detecting frequent patterns, that in this field are called motifs. Moreover, working with real-world data, the presence of some noise must be taken into account in the designing of the matching procedure [1720]. In many fields and particularly in a biological context, patterns should have long lengths and high supports, but standard sequential pattern mining approaches tend to discover a large amount of “low quality” patterns, i.e. patterns having either short lengths or low supports. It is easy to observe that genome sequences contain errors, so it is unlikely that long subsequences generated from the same origin will be exactly identical. Moreover, the increase of the minimum number of occurrences of a subsequence in a database, in case of exact matching, obliges to accept shorter and shorter subsequences, with the possibility to obtain a massive quantity of data with a less specific meaning. In such cases, exact matches techniques can give only short and trivial patterns. So, by allowing some mismatches, it is possible to discover valuable sequential patterns, with longer length and higher approximate supports. Some works [17, 18] use Hamming distances to search for recurrent motifs in data. Other works employ suffix tree data structure [21], suffix array to store and organize the search space [22] or use a GrC framework for the extraction of frequent patterns in data [23].

The algorithm presented in [24] uses a suffix-three data structure to mine frequent approximate contiguous subsequences (also called substrings). The procedure follows a “break-down-and-build-up” strategy. The “break-down” step aims at searching, by means of a suffix-tree based algorithm, for the longest subsequences which repeat, with an exact match, in the whole database. These subsequences represent the initial sequences (called strands), which will be iteratively assembled into longer strands by using a local search algorithm. The “build-up” step groups the obtained strands, forming the set from which all approximate subsequences will be identified.

The algorithm [25] uses a similar approach as [24], but taking into account the quality of sequential patterns. Good quality patterns can be obtained by balancing pattern length and pattern support. Short patterns are undesirable, particularly when sequences are long, since the meaning is less specific. Patterns with low supports are not desirable too, since they can be trivial and may not describe general phenomena. Thus, the algorithm is biased toward the search for longer subsequences, characterized by a sufficient frequency. It makes use of a suffix array to store and organize in a lexicographic order the search space (i.e., the set of subsequences). The search on such a suffix array follows a prefix extension approach, meaning that frequent subsequences are individuated analyzing the prefixes of the input sequences, tolerating inexactness during the evaluation.

Computational biology community has developed a lot of algorithms for mining frequent motifs using the Hamming distance as similarity measure. YMF [17] is based on the computation of the statistical significance of each motif, but its performances decrease as the complexity of motifs increases. Weeder [18] is a suffix-tree-based algorithm and is faster than YMF, because it considers only certain types of mismatches for the motifs, however it can not be used for different types of motifs. Another algorithm, MITRA [19], is a mismatch-tree-based approach and uses heuristics to prune the space of possible motifs.

Analysis and interpretation of time series is another challenging problem that many authors try to solve [26, 27]. Some works consider the problem of mining for motifs in time series databases in several applications: from the analysis of stock prices to the study of the ECG in medicine, to the analysis of measures from sensors. In particular in [28] it is showed how to discretize a time series, in order to obtain a sequence of symbols, defined over a fixed alphabet and use well known motif mining algorithms. However, in the discretization process, a lot of information is lost. Moreover, this algorithm uses exact matching procedures for mining patterns and is unusable in real cases with noisy data. In Chiu et al. [29] present another algorithm, based on [20], that considers the presence of noise in data. However, also in this case, a simple model of mismatches is considered. In [30] the algorithm FLAME is presented. It consists in a suffix-tree-based technique and can be used also with time series data sets, by converting such data into a sequence of symbols, discretizing the numeric data. All these approaches suffer from the loss of information during the discretization procedure.

In the following, we present a clustering-based subsequences mining algorithm that can be used with general sequence databases, choosing a suited similarity measure, depending on the particular application. Moreover, most methods focus only on the recurrence of patterns in data without taking into account the concept of “information redundancy”, or, in other words, the existence of overlapping among retrieved patterns [31]. Frequent pattern mining with approximate match is a challenging problem starting from the definition itself: even if one ignores small redundant patterns, there might be a huge number of large frequent redundant patterns. This problem should be taken in consideration, in a way that only some representatives of such patterns should survive after the mining process.

3 The Proposed Algorithm

In this work we present a new approximate subsequence mining algorithm called FRL-GRADIS (Filtered Reinforcement Learning-based GRanular Approach for DIscrete Sequences) [32] aiming to reduce the information redundancy of RL-GRADIS [33] by executing an optimization-based refinement process on the extracted patterns. In particular, this paper introduces the following contributions:

  1. 1.

    our approach finds the patterns that maximize the knowledge about the process that generates the sequences;

  2. 2.

    we employ a dissimilarity measure that can extract patterns despite the presence of noise and possible corruptions of the patterns themselves;

  3. 3.

    our method can be applied on every kind of sequence of objects, given a properly defined similarity or dissimilarity function defined in the objects domain;

  4. 4.

    the filtering operation produces results that can be interpreted more easily by application’s field experts;

  5. 5.

    considering this procedure as an inner module of a more complex classification system, it allows to further reduce the dimension of the feature space, thus better addressing the curse of dimensionality problem.

This paper consists of three parts. In the first part we provide some useful definitions and a proper notation; in the second part we present FRL-GRADIS as a two-step procedure, consisting of a subsequences extraction step and a subsequences filtering step. Finally, in the third part, we report the results obtained by applying the algorithm to synthetic data, showing a good overall performance in most cases.

4 Problem Definition

Let \(\mathcal {D} = \{\alpha _i \}\) be a domain of objects \(\alpha _i\). The objects represent the atomic units of information. A sequence S is an ordered list of n objects that can be represented by the set of pairs

$$ S = \{ (i \rightarrow \beta _i) \; | \; i = 1,\ldots , n ; \; \beta _i \in \mathcal {D} \}, $$

where the integer i is the order index of the object \(\beta _i\) within the sequence S. S can also be expressed with the compact notation

$$ S \equiv \langle \beta _1, \beta _2 , \ldots , \beta _{n} \rangle $$

A sequence database SDB is a set of sequences \(S_i\) of variable lengths \(n_i\). For example, the DNA sequence \(S = \langle G,T,C,A,A,T,G,T,C \rangle \) is defined over the domain of the four amino acids \(\mathcal {D} = \{ A, C, G, T \}\).

A sequence \(S_1 = \langle \beta '_1, \beta '_2 , \ldots , \beta '_{n_1} \rangle \) is a subsequence of a sequence \(S_2 = \langle \beta ''_1, \beta ''_2 , \ldots , \beta ''_{n_2} \rangle \) if \(n_1 \le n_2\) and \(S_1 \subseteq S_2 \). The position \(\pi _{S_2}(S_1)\) of the subsequence \(S_1\) with respect to the sequence \(S_2\) corresponds to the order index of its first element (in this case the order index of the object \(\beta '_1\)) within the sequence \(S_2\). The subsequence \(S_1\) is also said to be connected if

$$\beta '_j = \beta ''_{j+k} \; \; \; \forall j = 1, \ldots , n_1$$

where \(k = \pi _{S_2}(S_1)\). Two subsequences \(S_1\) and \(S_2\) of a sequence S are overlapping if

$$ S_1 \cap S_2 \ne \emptyset . $$

In the example described above, the complete notation for the sequence \(S = \langle G,T,C,A,A,T,G,T,C \rangle \) is

$$ S = \{ (1 \rightarrow G), (2 \rightarrow T), (3 \rightarrow C),\ldots \} $$

and a possible connected subsequence \(S_1 = \langle A, T, G \rangle \) corresponds to the set

$$ S_1 = \{ (5 \rightarrow A), (6 \rightarrow T), (7 \rightarrow G) \}. $$

Notice that the objects of the subsequence \(S_1\) inherit the order indices from the containing sequence S, so that they are univocally referred to their original positions in S. From now on we will focus only on connected subsequences, therefore the connection property will be implicitly assumed.

4.1 Pattern Coverage

The objective of this algorithm is to find a set of frequent subsequences of objects named as patterns. A pattern \(\varOmega \) is a subsequence of objects \(\langle \omega _{1}, \omega _{2} ,\ldots , \omega _{| \varOmega |} \rangle \), with \(\omega _i \in \mathcal {D}\), that is more likely to occur within the dataset SDB. Patterns are unknown a priori and represent the underlying information of the dataset records. Moreover, each sequence is subject to noise whose effects include the addition, substitution and deletion of objects in a random uncorrelated fashion and this makes the recognition of recurrent subsequences more challenging.

Fig. 1
figure 1

Coverage of the pattern \(\varOmega \) over the subsequence \(C \subseteq S\) with tolerance \(\delta \). Black boxes and gray boxes represent respectively the covered and the uncovered objects of the sequence S. Notice that if \(\delta > 0\) the sequences \(\varOmega \) and C need not to be of the same length

Given a sequence \(S \in SDB\) and a set of patterns \(\varGamma = \{ \varOmega _1,\ldots , \varOmega _m \}\), we want to determine a quality criterion for the description of S in terms of the pattern set \(\varGamma \). A connected subsequence \(C \subseteq S\) is said to be covered by a pattern \(\varOmega \in \varGamma \) iff \(d(C,\varOmega ) \le \delta \), where \(d(\cdot ,\cdot )\) is a properly defined distance function and \(\delta \) is a fixed tolerance (Fig. 1). The coverage \(\mathcal {C}_\varOmega ^{(\delta )}(S)\) of the pattern \(\varOmega \) over the sequence S is the union set of all non-overlapping connected subsequences covered by the pattern. We can write,

$$\begin{aligned} \begin{array}{lll} \mathcal {C}_\varOmega ^{(\delta )}(S) = \displaystyle \bigcup \limits _i \Big [ C_i \subseteq S \; \; \text {s.t.}\;\; d(C_i,\varOmega ) \le \delta \; \wedge \; C_i \cap C_j = \emptyset \; \forall \; i \ne j \Big ]. \end{array} \end{aligned}$$
(1)

Formally, this set is still not well defined until we expand on the meaning of the property

$$\begin{aligned} C_i \cap C_j = \emptyset , \end{aligned}$$
(2)

which is the requirement for the covered subsequences to be non-overlapping. Indeed, we need to include additional rules on how to deal with these overlappings when they occur. To understand better, let us recall the example of the DNA sequences presented above, where the dissimilarity measure between two sequences is the Levenshtein distance. The set of all covered subsequences \(C_i\) (in this context referred to as candidates) by the pattern \(\varOmega \) over the sequence S will consist only of sequences with values of length between \(|\varOmega | - \delta \) and \(|\varOmega | + \delta \). Indeed, these bounds correspond respectively to the extreme cases of deleting and adding \(\delta \) objects to the subsequence. In case of two overlapping candidates \(C_i\) and \(C_j\), in order to satisfy the property (2) of the coverage \(\mathcal {C}_\varOmega ^{(\delta )}(S)\), we have to define a rule to decide which subsequence belongs to the set \(\mathcal {C}_\varOmega ^{(\delta )}(S)\) and which does not. Candidates with smaller distances from the searched pattern \(\varOmega \) are chosen over overlapping candidates with higher distances. If the two overlapping candidates have the same distance the first starting from the left is chosen, but if also their starting position is the same the shorter one (i.e. smaller length value) has the precedence.

Fig. 2
figure 2

Coverage examples in the case of DNA sequences. The searched pattern \(\langle A, G, G, T \rangle \) is found 3 times with tolerance \(\delta \le 1\) using the Levenshtein distance. The three occurrences show all the edit operations allowed by the considered edit distance, respectively objects substitution, deletion and insertion

A coverage example in the context of the DNA sequences is shown in Fig. 2. The coverage of the pattern \(\varOmega = \langle A, G, G, T \rangle \) over the sequence S is \(\mathcal {C}_\varOmega ^{(\delta )}(S) = \langle A,C,G,T\rangle \cup \, \langle G, G, T \rangle \cup \, \langle A, C, G, G, T \rangle \).

Similarly, the compound coverage of the pattern set \(\varGamma \) is defined as

$$\begin{aligned} \mathcal {C}_\varGamma ^{(\delta )}(S) = \bigcup _{\varOmega \in \varGamma } \mathcal {C}_\varOmega ^{(\delta )}(S). \end{aligned}$$
(3)

It is important to notice that, in this case, this set can include overlapping subsequences only if they belong to coverages of different patterns (i.e. it is assumed that different patterns can overlap). For example consider the case shown in Fig. 3. The coverage \(\mathcal {C}_{\{ \varOmega _1, \varOmega _2 \}}^{(\delta )}(S)\) for the patterns \(\varOmega _1 = \langle A, G, G, T \rangle \) and \(\varOmega _2 = \langle G,T,C \rangle \) is equal to \(\mathcal {C}_{\{ \varOmega _1, \varOmega _2 \}}^{(\delta )}(S) = \langle A, G, G, T, C \rangle \).

Fig. 3
figure 3

Example of the compound coverage of multiple symbols, where the symbols \(\langle G, T, C \rangle \) and \(\langle A, G, G, T \rangle \) have Levenshtein distances from the corresponding subsequences equal to 0 and 1, respectively. Notice that different symbols can cover overlapping subsequences, while competing coverages of the same symbol are not allowed and only the most similar subsequence is chosen

5 The Mining Algorithm

In this section, we describe FRL-GRADIS, as a clustering-based sequence mining algorithm. It is able to discover clusters of connected subsequences of variable lengths that are frequent in a sequence dataset, using an inexact matching procedure. FRL-GRADIS consists in two main steps:

  • the symbols alphabet extraction, which addresses the problem of finding the most frequent subsequences within a SDB. It is performed by means of the clustering algorithm RL-GRADIS [33] that identifies frequent subsequences as representatives of dense clusters of similar subsequences. These representatives are referred to as symbols and the pattern set as the alphabet. The clustering procedure relies on a properly defined edit distance between the subsequences (e.g. Levenshtein distance, DTW, etc.). However, this approach alone has the drawback of extracting many superfluous symbols which generally dilute the pattern set and deteriorate the interpretability of the produced pattern set.

  • the alphabet filtering step deals with the problem stated above. The objective is to filter out all the spurious or redundant symbols contained in the alphabet produced by the symbols extraction step. To accomplish this goal we employ a heuristic approach based on evolutionary optimization over a validation SDB.

One of the distinctive features of this algorithm is its generality with respect to the kind of data contained in the input sequence database (e.g., sequences of real numbers or characters as well as sequences of complex data structures). Indeed, both steps outlined above take advantage of a dissimilarity-based approach, with the dissimilarity function being a whatever complex measure between two ordered sequences, not necessarily metric.

In the following, we first describe the main aspects of the symbols alphabet extraction procedure, then we present the new filtering method. For more details on the symbols alphabet construction we refer the reader to [33].

5.1 Frequent Subsequences Identification

Consider the input training dataset of sequences \(\mathcal {T}=\{S_1,S_2, \ldots , S_{|\mathcal {T}|} \}\) and a properly defined dissimilarity measure \(d: \mathcal {T}\times \mathcal {T}\rightarrow \mathbb {R}\) between two objects of the training dataset (e.g., Levenshtein distance for strings of characters). The goal of the subsequences extraction step is the identification of a finite set of symbols \(\mathcal {A}_e = \{\varOmega _1,\varOmega _2, \ldots , \varOmega _{|\mathcal {A}_e|} \}\),Footnote 1 computed using the distance \(d(\cdot , \cdot )\) in a free clustering procedure. The algorithm we chose to accomplish this task is RL-GRADIS which is based on the well-known Basic Sequential Algorithmic Scheme (BSAS) clustering algorithm [33]. Symbols are found by analysing a suited set of variable-length subsequences of \(\mathcal {T}\), also called n-grams, that are generated by expanding each input sequence \(S \in \mathcal {T}\). The expansion is done by listing all n-grams with lengths varying between the values \(l_{\text {min}}\) and \(l_{\text {max}}\). The parameters \(l_{\text {min}}\) and \(l_{\text {max}}\) are user-defined and are respectively the minimum and maximum admissible length for the mined patterns. The extracted n-grams are then collected into the SDB \(\mathcal {N}\). At this point, the clustering procedure is executed on \(\mathcal {N}\). For each cluster we compute its representative, defined by the Minimum Sum of Distances (MinSOD) technique [33, 34], as the element having the minimum total distance from the other elements of the cluster. This technique allows to represent the corresponding clusters by means of their most characteristic elements.

The quality of each cluster is measured by its firing strength f, where \(f\in [0, 1]\). Firing strengths are used to track the dynamics describing the updating rate of the clusters when the input stream of subsequences \(\mathcal {N}\) is analyzed. A reinforcement learning procedure is used to dynamically update the list of candidate symbols based on their firing strength. Clusters with a low rate of update (low firing strength) are discarded in an on-line fashion, along with the processing of the input data stream \(\mathcal {N}\). RL-GRADIS maintains a dynamic list of candidate symbols, named receptors, which are the representatives of the active clusters. Each receptor’s firing strength (i.e. the firing strength of its corresponding cluster) is dynamically updated by means of two additional parameters, \(\alpha , \beta \in [0, 1]\). The \(\alpha \) parameter is used as a reinforcement weight factor each time a cluster \(\mathcal {R}\) is updated, i.e., each time a new input subsequence is added to \(\mathcal {R}\). The firing strength update rule is defined as follows:

$$\begin{aligned} f(\mathcal {R}) \leftarrow f(\mathcal {R}) + \alpha (1 - f(\mathcal {R})). \end{aligned}$$
(4)

The \(\beta \) parameter, instead, is used to model the speed of forgetfulness of receptors according to the following formula:

$$\begin{aligned} f(\mathcal {R}) \leftarrow (1 - \beta )f(\mathcal {R}). \end{aligned}$$
(5)

The firing strength updating rules shown in Eqs. (4) and (5) are performed for each currently identified receptor, after the analysis of each input subsequence. Therefore, receptors/clusters that are not updated frequently during the analysis of \(\mathcal {N}\) will likely have a low strength value and this will cause the system to remove the receptor from the list.

5.2 Subsequences Filtering

As introduced above, the output alphabet \(\mathcal {A}_e\) of the clustering procedure is generally redundant and includes many spurious symbols that make the recognition of the true alphabet quite difficult.

To deal with this problem, an optimization step is performed to reduce the alphabet size, aiming at retaining only the most significant symbols, i.e. only those that best resemble the original, unknown ones. Since this procedure works like a filter, we call the output of this optimization the filtered alphabet \(\mathcal {A}_f\) and, clearly, \(\mathcal {A}_f \subset \mathcal {A}_e\) holds. Nevertheless, it is important for the filtered alphabet’s size not to be smaller than the size of the true alphabet, since in this case useful information will be lost. Let \(\varGamma \subset \mathcal {A}_e\) be a candidate subset of symbols of the alphabet \(\mathcal {A}_e\) and \(S \in \mathcal {V}\) a sequence of a validation SDB \(\mathcal {V}\). We assume the descriptive power of the symbols set \(\varGamma \), with respect to the sequence S, to be proportional to the quantity \(|\mathcal {C}^{(\delta )}_\varGamma (S)|\) (cfr Eq. 3), i.e. the number of objects \(\beta _i \in S\) covered by the symbols set \(\varGamma \). In fact, intuitively, a lower number of uncovered objects in the whole SDB by \(\varGamma \) symbols can be considered as a clue that \(\varGamma \) itself will likely contain the true alphabet. The normalized number of uncovered objects in a sequence S by a pattern set \(\varGamma \) corresponds to the quantity

$$\begin{aligned} P = \frac{|S| - |\mathcal {C}^{(\delta )}_\varGamma (S)|}{|S|}, \end{aligned}$$
(6)

where the operator \(| \cdot |\) stands for the cardinality of the set. The term P assumes the value 0 when the sequence S is completely covered by the pattern set \(\varGamma \) and the value 1 when none of the symbols in \(\varGamma \) are present in the sequence S. Notice that \(\mathcal {C}^{(\delta )}_\varGamma (S)\) depends on the parameter \(\delta \) which represents the tolerance of the system towards the corruption of symbols’ occurrences caused by noise.

On the other hand, a bigger pattern set is more likely to contain spurious patterns which tend to hinder the interpretability of the obtained results, so smaller set sizes are to be preferred. This property can be described with the normalized alphabet size

$$\begin{aligned} Q = \frac{| \varGamma |}{| \mathcal {A}_e |}, \end{aligned}$$
(7)

where \(\mathcal {A}_e\) is the alphabet of symbols extracted by the clustering procedure described in the last section. Clearly, the cardinality of \(\mathcal {A}_e\) represents an upper bound for the size of the filtered alphabet, so the term Q ranges from 0 to 1. The terms P and Q generally show opposite trends, since a bigger set of symbols is more likely to cover a bigger portion of the sequence and vice versa.

Finding a tradeoff between these two quantities corresponds to minimizing the convex objective function

$$\begin{aligned} G^{(\delta )}_S(\varGamma ) = \lambda \, Q + (1-\lambda ) \, P \end{aligned}$$
(8)

where \(\lambda \in [ 0, 1 ]\) is a meta-parameter that weighs the relative importance between the two constributions. It is easy to verify that

$$\begin{aligned} 0 \le G^{(\delta )}_S(\varGamma ) \le 1. \end{aligned}$$
(9)

More generally, for a validation SDB \(\mathcal {V}\), the global objective function is the mean value of \(G^{(\delta )}_S(\varGamma )\) over all sequences \(S_i \in \mathcal {V}\), hence

$$\begin{aligned} G^{(\delta )}_\mathcal {V}(\varGamma ) = \frac{ \displaystyle \sum \limits _{1 \le i \le |\mathcal {V}|} G^{(\delta )}_{S_i}(\varGamma )}{|\mathcal {V}|} \end{aligned}$$
(10)

and the best symbols set after the optimization procedure is

$$\begin{aligned} \mathcal {A}_f = \mathop {{\mathrm {argmin}}}_{\varGamma \subset \mathcal {A}_e} G^{(\delta )}_S(\varGamma ). \end{aligned}$$
(11)

To solve the optimization problem described by Eq. (11) we employ a standard genetic algorithm, where each individual of the population is a subset \(\varGamma \) of the extracted alphabet \(\mathcal {A}_e = \{ \varOmega _1,\ldots , \varOmega _{|\mathcal {A}_e|} \}\). The genetic code of the individual is encoded as a binary sequence E of length \(|\mathcal {A}_e|\) of the form

$$\begin{aligned} E_\varGamma = \langle e_1, e_2, \ldots , e_{|\mathcal {A}_e|} \rangle \end{aligned}$$
(12)

with

$$ e_i = \left\{ \begin{array}{l} 1 \;\;\;\; \text {iff} \;\; \varOmega _i \in \varGamma \\ 0 \;\;\;\; \text {otherwise} \end{array}. \right. $$

It is important not to mistake genetic codes with the SDB sequences described earlier, even if they are both formally defined as ordered sequences.

Given a validation dataset \(\mathcal {V}\) and a fixed tolerance \(\delta \), the fitness value \(F(E_\varGamma )\) of each individual \(E_\varGamma \) is computed as the following affine transformation of the objective function introduced in the last paragraph

$$\begin{aligned} F(E_\varGamma ) = 1 - G^{(\delta )}_\mathcal {V}(\varGamma ) \end{aligned}$$
(13)

The computation is then performed with standard crossover and mutation operators between the binary sequences and the stop condition is met when the maximum fitness does not change for a fixed number \(N_\text {stall}\) of generations or after a given maximum number \(N_\text {max}\) of iterations. When the evolution stops, the filtered alphabet \(\mathcal {A}_f = \widetilde{\varGamma }\) is returned, where \(\widetilde{\varGamma }\) is the symbols subset corresponding to the fittest individual \(E_{\widetilde{\varGamma }}\).

6 Tests and Results

In this section, we present results from different experiments that we designed to test the effectiveness and performance of FRL-GRADIS in facing problems with varying complexity.

6.1 Data Generation

We tested the capabilities of FRL-GRADIS on synthetic sequence databases composed of textual strings. For this reason, the domain of the problem is the English alphabet

$$ \mathcal {D} = \{A, B, C, \ldots , Z\}. $$

Modeled noise consists in all cases of random characters insertions, deletions and substitutions to the original string. For this reason a natural choice of dissimilarity measure between sequences is the Levenshtein distance, that measures the minimum number of edit steps necessary to transform one string of characters into another. We conducted two different classes of tests, which accounted for two kinds of noise, respectively symbols noise and channel noise, presented in the following paragraphs.

6.1.1 Symbols Noise

This kind of noise simulates those situations in which symbols are altered during the composition of the sequence. In fact, each instance of a symbol being added to the data sequence has a fixed probability of being mutated by one addition, deletion or modification of its objects. Moreover, a variable number of uncorrelated objects are added between contiguous instances of symbols in the sequence, to simulate the presence of irrelevant data separating actual symbols. The detailed process of data generation is described below:

  1. 1.

    the true symbols alphabet \(\mathcal {A}_{\,\text {t}}\) is generated. This alphabet consists of \(N_\text {sym}\) symbols with lengths normally distributed around the mean value \(L_\text {sym}\). Each character is chosen in \(\mathcal {D}\) with uniform probability and repeated characters are allowed;

  2. 2.

    a training SDB \(\mathcal {T}\) and a validation SDB \(\mathcal {V}\) respectively composed of \(N_\text {tr}\) and \(N_\text {val}\) sequences are generated. Each of these sequences is built by concatenating \(N_\text {symseq}\) symbols chosen randomly from \(\mathcal {A}_{\,\text {t}}\). Notice that generally \(N_\text {symseq} > N_\text {sym}\) so there will be repeated symbols;

  3. 3.

    in each sequence, every symbol will be subject to noise with probability \(\mu \). The application of noise to a symbol in a sequence corresponds to the deletion, substitution or insertion of one character to that single instance of the symbol. This kind of noise is referred to as intra-pattern noise;

  4. 4.

    a user-defined quantity of random characters is added between instances of symbols in each sequence. This noise is called inter-pattern noise. Such quantity depends on the parameter \(\eta \) that corresponds to the ratio between the number of characters belonging to actual symbols and the total number of character of the sequence after the application of inter-pattern noise, that is,

    $$ \eta = \frac{(\#\ \text {symbol characters})}{(\#\ \text {total characters})}.$$

    Notice that the amount of inter-pattern noise is inversely proportional to the value of \(\eta \).

6.1.2 Channel Noise

In this case we simulate a noise affecting an hypotetical channel through which the sequence is transmitted. This kind of noise alters the objects in a uncorrelated manner, without keeping track of the separation between symbols.

In this case we generate the original SDB \(\mathcal {T}\) and \(\mathcal {V}\) in the same manner as described in steps 1 and 2 of Sect. 6.1.1. The noise is then added to these datasets by iterating through the objects of the sequence and altering each object with a fixed probability p. The alteration consists with equal probability in either:

  • the substitution of the object with another randomly chosen object;

  • the deletion of the object;

  • the addition of another randomly chosen object to the right of the current object position in the sequence.

The generated datasets \(\mathcal {T}\) and \(\mathcal {V}\) are then ready to be used as input of the FRL-GRADIS procedure. Notice that the true alphabet \(\mathcal {A}_{\,\text {t}}\) is unknown in real-world applications and here is used only to quantify the performance of the algorithm.

6.2 Quality Measures

We now introduce the quality measures used in the following tests to evaluate the mining capabilities of the FRL-GRADIS algorithm. These measures are computed for the resulting alphabets obtained from both the extraction and the filtering steps presented in Sect. 5, in order to highlight the improvement made by the filtering procedure (i.e. the improvement of FRL-GRADIS over RL-GRADIS).

The redundance R corresponds to the ratio between the cardinality of the alphabet \(\mathcal {A}\) and the true alphabet \(\mathcal {A}_{\,\text {t}}\), that is,

$$\begin{aligned} \begin{array}{lll} R= & {} {\displaystyle \frac{|\mathcal {A}|}{|\mathcal {A}_{\,\text {t}}|}} \end{array} \end{aligned}$$
(14)

Clearly, since the filtering step selects a subset \(\mathcal {A}_\text {f}\) (filtered alphabet) of the extracted alphabet \(\mathcal {A}_\text {e}\), we always have that

$$ R_\text {f} < R_\text {e}. $$

The redundance measures the amount of unnecessary symbols that are found by a frequent pattern mining procedure and it ranges from zero to infinite. When \(R > 1\) some redundant symbols have been erroneously included in the alphabet, while when \(R < 1\) some have been missed, the ideal value being \(R = 1\).

It is important to notice that the redundancy depends only on the number of symbols reconstructed, but not on their similarity with respect to the original alphabet. For this purpose we also introduce the mining error E, defined as the mean distance between each symbol \(\varOmega _i\) of the true alphabet \(\mathcal {A}_{\,\text {t}}\) and its best match within the alphabet \(\mathcal {A}\), where the best match means the symbol with the least distance from \(\varOmega _i\). In other words, considering \(\mathcal {A}_{\,\text {t}} = \{ \varOmega _1, \ldots , \varOmega _{|\mathcal {A}_{\,\text {t}}|} \}\) and \(\mathcal {A} = \{ {\tilde{\varOmega }}_1, \ldots , {\tilde{\varOmega }}_{|\mathcal {A}|} \}\), the mining error corresponds to

$$\begin{aligned} E = \frac{\sum _i d(\varOmega _i, {\tilde{\varOmega }}_{(i)})}{|\mathcal {A}_{\,\text {t}}|} \end{aligned}$$
(15)

where

$$ {{\tilde{\varOmega }}_{(i)}} = \mathop {{\mathrm {argmin}}}_{{\tilde{\varOmega }} \, \in \, \mathcal {A}} \;d(\varOmega _i, {\tilde{\varOmega }}). $$

This quantity has the opposite role of the redundancy, in fact it keeps track of the general accuracy of reconstruction of the true symbols regardless of the generated alphabet size. It assumes non-negative values and the ideal value is 0. For the same reasons stated above the inequality

$$ E_\text {f} \ge E_\text {e} $$

holds, so the extraction procedure’s mining error constitutes a lower bound for the mining error obtainable with the filtering step.

6.3 Results

We executed the algorithm multiple times for different values of the noise parameters, to assess the different response of FRL-GRADIS to increasing amounts of noise. Most parameters have been held fixed for all the tests and they are listed in Table 1.

Table 1 Fixed parameters adopted for the tests

As a first result, we present the synthetic tests performed by adding varying quantities of symbols noise to the data sequences, performed with \(\mu = 0.5\) and variable amounts of inter-pattern noise \(\eta \). It means that about half of the symbols in a sequence are subject to the alteration of one character and increasing amounts of random characters are added between symbols in each sequence. The results obtained with this configuration are shown in Figs. 4 and 5.

Fig. 4
figure 4

Plot of the redundance R of the extraction (RL-GRADIS) and filtering (FRL-GRADIS) steps with variable inter-pattern noise and \(\mu = 0.5\)

Fig. 5
figure 5

Plot of the mining error E of the extraction (RL-GRADIS) and filtering (FRL-GRADIS) steps with variable inter-pattern noise and \(\mu = 0.5\)

The redundancy plot in Fig. 4 shows an apparently paradoxical trend of the extraction procedure’s redundancy: with decreasing amounts of inter-pattern noise (i.e. increasing values of \(\eta \)) the extraction algorithm performs more poorly, leading to higher redundancies. That can be easily explainable by recalling how the clustering procedure works.

Fig. 6
figure 6

Plot of the redundance R of the extraction (RL-GRADIS) and filtering (FRL-GRADIS) steps with variable channel noise. Error bars represent the standard deviation over 3 runs of the algorithm with the same parameters

Fig. 7
figure 7

Plot of the mining error E of the extraction (RL-GRADIS) and filtering (FRL-GRADIS) steps with variable channel noise. Error bars represent the standard deviation over 3 runs of the algorithm with the same parameters

Higher amounts of inter-pattern noise mean that the frequent symbols are more likely to be separated by random strings of characters. These strings of uncorrelated characters generate very sparse clusters with negligible cardinality that are very likely to be deleted during the clustering’s reinforcement step. Clusters corresponding to actual symbols, instead, are more active and compact, their bounds being clearly defined by the noise characters, and so they are more likely to survive the reinforcement step.

In case of negligible (or non-existent) inter-pattern noise, instead, different symbols are more likely to occur in frequent successions that cause the generation of many clusters corresponding to spurious symbols, obtained from the concatenation of parts of different symbols. The filtering procedure overcomes this inconvenience, as it can be seen from Fig. 4 that it is nearly not affected by the amount of inter-pattern noise. As it is evident, the filtering procedure becomes fundamental for higher values of the parameter \(\eta \), where the clustering produces highly redundant alphabets that would be infeasible to handle in a real-world application. Figure 5 shows that the mining error after the filtering procedure remains mostly the same for all values of \(\eta \), which means that the system is robust to the moderate alteration of the input signal.

In the second pool of tests we show the response of the system to increasing quantities of channel noise p. In Fig. 6 and 7 are shown the redundance and the mining error measured for different values of p. While FRL-GRADIS shows slightly higher mining error levels than RL-GRADIS, its redundancy is still significantly lower and, to a large extent, insensitive to the quantity of noise in the system. Clearly, lower mining error levels are obtainable by setting suitable values of the parameter \(\lambda \) (at the expense of resulting redundancy) or stricter convergence criteria of the genetic algorithm (at the expense of convergence time).

In general, we can conclude that the system allows for a remarkable synthesis of the extracted alphabet despite of a modest additional mining error.

7 Conclusions

In this work we have presented a new approach to sequence data mining, focused on improving the interpretability of the frequent patterns found in the data. For this reason, we employed a two-steps procedure composed of a clustering algorithm, that extracts the frequent subsequences in a sequence database, and a genetic algorithm that filters the returned set to retrieve a smaller set of patterns that best describes the input data. For this purpose we introduced the concept of coverage, that helps in recognizing the true presence of a pattern within a sequence affected by noise. The experiments were performed on two cases of synthetic data affected by two different sources of noise. The results have shown a good overall performance and lay the foundations for improvements and further experiments on real data.