1 Introduction

A sequence is an ordered series of discrete items, where each item can be a bucket of elements. For example, \(\{\langle B \rangle \langle AAB \rangle \langle CC \rangle \langle A \rangle \}\). A commonly found specific case of sequences is when each item has only one element, e.g., \(\{\langle \texttt {B} \rangle \langle \texttt {A} \rangle \langle \texttt {A} \rangle \langle \texttt {B} \rangle \langle \texttt {C} \rangle \langle \texttt {C} \rangle \langle \texttt {A} \rangle \}\) or simply put \(\texttt {BAABCCA}\). In this paper, a work has been done on this class of sequences (a.k.a strings) that we define as a series of discrete symbols sequentially tied together in a certain order. A symbol can be an event, or a value. Such sequences are found in processes where only one discrete event can occur at one time, such as clickstream, music listening history, weblog, patient movements, and protein sequences.

Sequence data is omnipresent which has led to the development of various sequence mining methods. Sequence mining research can be broadly divided into: (a) frequent pattern or subsequence mining (Aggarwal and Han 2014), (b) motifs detection (Sandve and Drabløs 2006), (c) alignment (Li and Homer 2010), (d) datastream modeling (Silva et al. 2013), and (e) feature embedding (Kumar et al. 2012). Among them, feature embedding is particularly important because it provides a machine-interpretable representation for the sequences. They can be used directly for (dis)similarity or “distance” computation between sequences or other machine learning models. A similar approach word2vec is popular in text mining for converting text into vector embeddings. This enables building sequence classification and clustering models, which have immense applications across the online industry, Bioinformatics, and healthcare.

Feature embedding is, however, challenging because (a) sequences are arbitrary strings of arbitrary lengths, and (b) long-term dependencies (of sequence elements) are difficult to capture. A long-term dependency here means the effect of distant elements in a sequence on each other.

N-gram methods (also known as k-mers) are commonly used for feature representation. Several sequence kernels are developed on top of the n-grams features. Moreover, generative parametric models, such as n-order Markov and HMM models have been developed for sequences in which sequence features are represented by the transition and emission probability matrices.

However, in addition to other limitations (discussed in Sect. 1.1), most of the existing methods either limit themselves by extracting only short-term patterns or suffer from increasing computation upon extracting the long-term patterns.

Additionally, accurately comparing sequences of different lengths is a non-trivial problem. Traditional methods often lead to false positives. A false positive here implies incorrectly identifying two different sequences as similar. Consider these sequences: s1. \(\texttt {ABC}\), s2. \(\texttt {ABCABCABC}\), and s3. . Most traditional subsequence matching methods will render s1 similar to both s2 and s3. However, we call similarity of s1 and s3 a false positive because s3’s overall pattern is significantly different from s1.

In this paper, a new sequence feature embedding function, Sequence Graph Transform (SGT), that extracts the short- and long-term sequence features without any increase in the computation has been developed. This unique property of SGT removes the computation limitations; it enables us to tune the amount of short- to long-term patterns that will be optimal for a given sequence problem. SGT also addresses the issue of false positives upon comparing sequences of different lengths.

SGT embedding is a nonlinear transform of the inter-symbol distances in a sequence. Its name is attributed to the graphical interpretation of the embedding which shows the “association” between sequence symbols.

SGT is in a finite-dimensional feature space that can be used as a vector in most mainstream data mining methods, such as kmeans, kNN, SVM, and Deep Learning architectures. Moreover, it can be used as a graph for applying graph mining methods and interpretable visualizations.

We show that these properties have led to a significantly higher accuracy in sequence modeling with lower computation. We theoretically prove the SGT properties, and experimentally and practically validate its efficacy. We also show that SGT features can be used as an embedding layer in a Feed-forward Neural Network (FNN). In our real-world data sets, this outperformed the current state-of-the-art long- and short-term neural network (LSTM) classifiers in both runtime and accuracy.

1.1 Related work

Sequence mining is an extensively studied problem. Several works have been done specifically to estimate sequence similarity and feature representations for sequence classification, clustering, etc. We categorize the literature as follows.

Alignment Sequence alignment has two broad types: global alignment (Needleman and Wunsch 1970) and local alignment (Stoye et al. 1997). Global alignment finds the sequence similarity between sequences over their entire length. While they work better in pairwise sequence comparison, it becomes prohibitively time intensive on large sequence data sets. For them, multiple sequence alignment (MSA) techniques were developed.

Several MSA techniques accomplished global alignment (Notredame et al. 2000; Thompson et al. 1994b). But they were ineffective when sequences have common patterns (homologous) only over local regions. In such cases, local alignment needs to be performed (Bailey et al. 1994; Lawrence et al. 1993; Morgenstern 1999).

In most of these methods, the computation complexity remains an issue. Dynamic Programming (DP) has been used in the MSA techniques. Here DP suffers from high-dimensional problems in MSA because the number of sequences is equal to the number of dimensions. It is stated in Wang and Jiang (1994) if two or more optimal paths are available and need to trace backward, the complexity of the backtracking grows exponentially.

MSA is an NP-complete problem. To solve them, two types of approaches are prevalent: exact and progressive alignment. Exact algorithms usually deliver high-quality alignment that is very close to the optimal but applying them on most real problems is unrealistic due to excess complexity (Lipman et al. 1989; Stoye et al. 1997). Progressive alignment is used in CLUSTAL (Thompson et al. 1994a), BLAST (Altschul et al. 1997), FASTA (Pearson 1990), UCLUST (Edgar 2010), CD-HIT (Fu et al. 2012), and MUSCLE (Edgar 2004). These alignment algorithms are greedy in nature. This does not allow modification of string gaps and, hence, the alignment similarity cannot be adjusted at a later stage. Also, a greedy algorithm can be trapped in local minima. Another major drawback is that most progressive alignments are sensitive to the initialization (the initial alignment). Moreover, most alignment algorithms are heuristics and face these challenges. They, therefore, suffer from accuracy and computation issues due to which sequence alignment is still under research.

Kernels Sequence mining using string kernels has been considerably worked on. In the current literature, kernel function has proven to be an effective method (Leslie et al. 2004; Xing et al. 2010).

Over the last few decades, several string kernel methods have been proposed, e.g., Cristianini et al. (2000), Kuang et al. (2005), Leslie et al. (2001, 2004), Eskin et al. (2003) and Smola and Vishwanathan (2003). Among them, the k-spectrum kernel (Leslie et al. 2001), (k, m) mismatch kernel, and their variants (Eskin et al. 2003; Leslie et al. 2004) gained popularity in the early 2000s.

These kernels decompose the original strings into sub-structures, i.e., k-mers (small strings). They then find the count of the k-mers with up to m mismatches in the original sequence to define a feature map. However, only the patterns of short subsequences are captured in these methods. They fail to capture long-term patterns. To address this, if larger k and m are taken, the feature map and the computation grows exponentially. This makes them applicable to only small k, m, and eventually to small data sets (Wu et al. 2019).

A thread of recent research has made valid attempts to improve the computation of the kernel matrix, e.g., Farhan et al. (2017) and kuksa2009scalable. But these methods only address the scalability issue in terms of the length of strings and the size of the symbols set. The kernel matrix construction still has a quadratic complexity with respect to the number of strings. Moreover, these methods inherit the issues of “local” kernels, i.e., long-term dependencies are ignored.

More recently, a string kernel with random features was introduced (Wu et al. 2019). This family of string kernels is defined through a series of different random feature maps. They discover global long-term patterns and maintain a computation cost linear with respect to the string length and the number of string samples. This kernel produces random string embeddings (RSE) by utilizing random feature approximations from randomly generated strings. These random strings have a short length to reduce the computation complexity from quadratic to linear. But the approximation for computational gain has a counter effect on the efficacy of the kernel embeddings.

Another class of kernel methods computes pairwise sequence similarity using some global or local alignment measure, e.g., Needleman and Wunsch (1970) and Smith and Waterman (1981). These string alignment kernels are defined using a learning methodology R-convolution (Haussler 1999)—a framework for computing the kernels between discrete objects. It works by recursively decomposing structured objects into sub-structures and computes their global and local alignments to derive a feature map. While these methods cover both short-term (local) and long-term (global) dependencies, they have high computation costs: quadratic in both the number and the length of sequences.

Time-series classification Sequences are a special type of time series. A typical time series is a sequence of observations of a continuous variable and a sequence is the same for a discrete or categorical variable. It has been also stated in Gamboa (2017) that any classification problem in which the data is registered taking into account some notion of order can be cast as time series classification (TSC) problem.

TSC has been deeply studied. With the increase in temporal data, several TSC algorithms have been proposed in the past decade, e.g., Bagnall et al. (2015). One of the most popular and traditional TSC approaches is the use of the nearest neighbor (NN) classifier in conjunction with a distance function (Lines and Bagnall 2015). Bagnall et al. (2015) showed that dynamic time warping (DTW) distance with an NN classifier worked effectively.

Prior research in Lines and Bagnall (2015) also showed that DTW distance measure worked better than other distance measures. Some recent contributions include ensembling methods, e.g., Bagnall et al. (2015), Hills et al. (2014), Bostrom and Bagnall (2017), Lines et al. (2016), Schäfer (2015), Kate (2016), Deng et al. (2013) and Baydogan et al. (2013). Regardless of the approach, they relied on an effective distance measure. And most of the distance measures faced issues related to effectively capturing both short and long term dependencies with tractable computation.

Deep learning Deep learning-based methods have touched on a variety of problems including sequence problems. Natural language processing and speech recognition problems solved with deep learning architectures have a similar construct as a sequence problem. Due to this, the methods developed for the former have been adopted in sequence problems, e.g., in Lines et al. (2016, 2018), Bagnall et al. (2017) and Neamtu et al. (2018).

More specifically, LSTMs in RNNs are extensively used for sequence mining problems due to its ability to learn long- and short- term sequence patterns (Graves 2013). LSTMs are commonly used for building supervised sequence models and sequence-to-sequence predictions (Sutskever et al. 2014). However, LSTMs and other RNNs cannot differentiate between length sensitive and insensitive sequence problems (discussed in Sect. 1.2). The LSTM layer is also not interpretable for a visualization. Additionally, an LSTM is computationally intensive compared to an FNN model used with an SGT embedding (shown in Sect. 5.1).

Pattern discovery N-gram (also known as k-mers) methods and their variants (Comin and Verzotto 2012; Didier et al. 2012) are popular approaches for pattern discovery. However, their feature space and computation increase exponentially for long-term dependencies. Another class of methods does frequent subsequence discovery using apriori-like breadth-first search methods or pattern-growth depth-first search methods, e.g. GSP (Srikant and Agrawal 1996), PSP (Masseglia et al. 1998), and SPADE (Zaki 2001). These methods, however, had a critical nontrivial computation that was addressed by PrefixSpan (Han et al. 2001; Chiu et al. 2004), and SPAM (Ayres et al. 2002). While some of these methods are more suitable for sequences of item sets, most of their feature representations can lead to poor accuracy.

Wang et al. (2005) extracted features from protein sequences using a 2-gram encoding method and 6-letter exchange group methods to find the global similarity. They used this with a neural network model. Some user-defined variables like len, mut, and occur were also used to find the local similarities. Wu et al. (2006) enlarged 2-gram encoding to an n-gram to improve the accuracy. Zainuddin and Kumar (2008) developed a radial-based approach to reduce the computational overhead of n-gram encoding method. Zaki et al. (2004) used a hidden Markov model to extract features that were applied to the classifier that can train the data in high-dimensional space. They used their features in building SVM classifiers.

Hash maps Hash maps address the high-dimensional input spaces for fixed or variable length n-gram spaces by performing dimensionality reduction. They are typically developed in Bioinformatics (Buhler 2001; Buhler and Tompa 2002; Indyk and Motwani 1998; Wesselink et al. 2002). Shi et al. (2009) used hashing to compare all subgraph pairs on biological graphs. However, feature hashing can result in significant loss of information, especially when hash collisions occur between highly frequent features with significantly different class distributions. On the lines of hashing methods, Genome fingerprinting methods have been developed (Glusman et al. 2017). They are fast and accurate. However, they are particularly built for and suitable to genome data due to the small symbol set size and also known domain knowledge.

Generative The parametric generative methods typically make Markovian distribution assumptions, more specifically a first-order Markov property, e.g., Cadez et al. (2003) and Ranjan et al. (2015). However, such a distributional assumption is not always valid. A general n-order Markov model was also proposed but not popular in practice due to high computation. Hidden Markov model-based approaches are popular in both bioinformatics and general sequence problems (Helske and Helske 2017; Remmert et al. 2012). It assumes a hidden layer of latent states which results in the observed sequence. These hidden states have a first-order Markov transition assumption that due to the multi-layer setting, the first-order assumption is not transmitted to the observed sequence. However, tuning HMM (finding optimal hidden states) is difficult and it is computationally intensive.

Graph based Temporal graphs is a category of graph representations similar to SGT defined in this paper. Temporal graphs were used for Phenotyping in Liu et al. (2015) and Temporal Skeletonization in Liu et al. (2016). However, the definition of the developed SGT is fundamentally different from these Temporal graphs. Moreover, SGT’s ability to capture the short- and long-term features are theoretically substantiated. Another class of graph methods hypothesizes that sequences are generated from some evolutionary process where a sequence is produced by reproducing complex strings from simpler substrings, as in Siyari et al. (2016) and references therein. However, the estimation algorithms for these methods are heuristics, sometimes greedy, and have identifiability issues. Moreover, the evolutionary assumption may not be always true.

1.2 Research specification

1.2.1 Problem

The related methods discussed above fail to address at least one of the following challenges: (a) capturing long-term dependencies, (b) false positives upon comparing sequences of different lengths, and (c) domain-specific and/or computation complexity with respect to sequence length, sample size, and the size of symbols set, where sequence length is the total number of symbols in the sequence, sample size is the number of sequences in the data set, and the symbols set is the set of unique symbols that make the sequences of the data set.

We propose a new sequence feature extraction function, Sequence Graph Transform (SGT), that addresses the above challenges and is shown to outperform existing state-of-the-art methods in sequence data mining. SGT works by quantifying the pattern in a sequence by scanning the positions of all symbols relative to each other. We call it a graph transform because of its inherent property of interpretation as a graph, where the symbols form the nodes and a directed connection between two nodes shows their “association.” These “associations” between all symbols represent the signature features of a sequence.

Sequence analysis problems can be broadly divided into (a) length-sensitive: the inherent patterns, as well as the sequence lengths, should match to render two sequences as similar, e.g., in protein sequence clustering, and (b) length-insensitive: the inherent patterns should be similar, irrespective of the lengths, e.g., weblog comparisons. In contrast with the existing literature, SGT provides a solution for both scenarios. The advantage of this property becomes more pronounced when we have to perform both types of analysis on the same data and implementing different methods for each becomes cumbersome.

1.2.2 Contribution

In this paper, our major contribution is the development of a new sequence feature embedding function: Sequence Graph Transform. SGT embedding exhibits the following properties,

  1. 1.

    Short- and long-term Captures both short and long term dependencies, i.e., both local and global patterns. The amount of long term dependency to incorporate can be controlled with a tuning parameter. Importantly, unlike the existing methods, enlarging the long-term dependency does not affect on SGT computation. By removing the computation limitation, SGT embedding can be effectively tuned based on a problem requirement.

  2. 2.

    Computationally tractable SGT computation complexity is linear with respect to the number of sequences. Moreover, there are two algorithms proposed for SGT estimation. One is selected based on which is higher between the sequence length and alphabet set.

  3. 3.

    Compatibility Compatible with mainstream supervised and unsupervised learning methods. SGT is an embedding function that converts an unstructured sequence into a finite-dimensional vector. Mainstream learning methods, such as SVM classifier or k-means clustering, take such vectors as inputs. SGT embedding used in conjunction with mainstream learning methods yields significantly superior results for sequence problems.

  4. 4.

    Interpretability Embedded features are interpretable. Each feature in an SGT embedding corresponds to a directional dependency between a symbol pair. For example, SGT embedding of a sequence \(\texttt {BAABCCA}\) will have a feature corresponding to each 2-permutation of symbols: \(\texttt {(A,A);\, (A,B);\,}\) \(\texttt {(A,C);\, (B,A);\, (B,B);\, (B,C);\, (C,A);\, (C,B);\, (C,C)}\). The symbol order in a feature \(\texttt {(}i,j)\) indicates the forward dependency from \(\texttt {i}\) to \(\texttt {j}\): a high value of feature indicates a high forward dependency, i.e., \(\texttt {i}\) is followed by a significant amount of \(\texttt {j}\)’s in the sequence.

It is important to note that SGT is an embedding methodology. It is used in conjunction with mainstream supervised and unsupervised learning methods. The source code, data sets, and illustrative examples are provided at https://github.com/cran2367/sgt (see “Appendix E”).

1.2.3 Limitations

SGT works in most sequence problems but has the following limitations.

  1. 1.

    The effectiveness of SGT embedding becomes diminished if the sequence symbol set is small. For example, in binary or DNA sequences where \(symbols \in \{0,1\}\) and \(\{\texttt {A,C,G,T}\}\), respectively. This is because the size of the embedding is proportional to the size of the symbol set. In the above examples, the SGT embedding will be of size 4 and 16, respectively. But if the sequence length is high, the embedding cannot hold sufficient information that characterizes the sequence.

  2. 2.

    The proposed SGT algorithm applies to sequences of single element items called as symbols. The reason is that SGT works by extracting dependencies between items in a sequence. For this, SGT assumes an item to be unique. However, an item is a bucket of elements—items can have common elements, e.g., the items in \(\{\langle B \rangle \langle AAB \rangle \langle CC \rangle \langle A \rangle \}\) share elements. The proposed SGT does not draw information from the presence of common elements in items. An extension to multi-element item sequences is non-trivial and should be pursued in future research.

In the following, we develop SGT and provide its theoretical support. We perform an extensive experimental evaluation and show that SGT bridges the gap between sequence mining and mainstream data mining through direct application of fundamental methods, viz. PCA, k-means, SVM, and graph visualization via SGT on sequence data analysis.

2 Sequence graph transform (SGT)

2.1 Overview and intuition

Fig. 1
figure 1

Illustration of the “effect” of elements on each other

By definition, a sequence can be either feed-forward or bidirectional. In a feed-forward sequence, events (symbol instances) occur in succession; e.g., in a clickstream sequence, the click events occur one after another in a forward direction. On the other hand, in a bidirectional sequence, the directional or chronological order of symbol instances is not present or not important. In this paper, we present SGT for feed-forward sequences; SGT for bidirectional sequences is given in Sect. 7.1.

For either of these sequence types, the developed SGT works on a fundamental premise—the relative positions of symbols in a sequence characterize the sequence—to extract the pattern features of the sequence. This premise holds for most sequence mining problems because the similarity in sequences is often measured based on the similarities in their pattern from the symbol positions.

Figure 1 shows an illustrative example of a feed-forward sequence. In this example, the presence of symbol \(\texttt {B}\) at positions 5 and 8 should be seen in context with or as a result of all other predecessors. To extract the sequence features, we take the relative positions of one symbol pair at a time. For example, the relative positions for pair (\(\texttt {A}\),\(\texttt {B}\)) are {(2,3),5} and {(2,3,6),8}, where the values in the position set for \(\texttt {A}\) are the ones preceding \(\texttt {B}\). In the SGT procedure defined and developed in the following Sects. 2.3 and 2.4, the sequence features are shown to be extracted from these positions information.

Fig. 2
figure 2

SGT overview

These extracted features are an “association” between \(\texttt {A}\) and \(\texttt {B}\), which can be interpreted as a connection feature representing “\(\texttt {A}\) leading to \(\texttt {B}\).” We should note that “\(\texttt {A}\) leading to \(\texttt {B}\)” will be different from “\(\texttt {B}\) leading to \(\texttt {A}\).” The associations between all symbols in the symbol set denoted as \({\mathcal {V}}\) can be extracted similarly to obtain sequence features in a \(|{\mathcal {V}}|^{2}\)-dimensional space.

This is similar to the Markov probabilistic models, in which the transition probability of going from \(\texttt {A}\) to \(\texttt {B}\) is estimated. However, SGT is different because the connection feature (1) is not a probability, and (2) takes into account all orders of the relationship without any increase in computation.

Besides, the SGT also make it easy to visualize the sequence as a directed graph, with sequence symbols in \({\mathcal {V}}\) as graph nodes and the edge weights equal to the directional association between nodes. Hence, we call it a sequence graph transform. Moreover, we show in Sect. 7.2 that under certain conditions, the SGT also allows node (symbol) clustering.

A high-level overview of our approach is given in Fig. 2a, b. In Fig. 2a, we show that applying SGT on a sequence, s, yields a finite-dimensional SGT feature vector \(\varPsi ^{(s)}\) for the sequence, also interpreted and visualized as a directed graph. For a general sequence data analysis, SGT can be applied to each sequence in the sample (Fig. 2b). The resulting feature vectors can be used with mainstream data mining methods.

2.2 Notations

Suppose we have a data set of sequences denoted by \({\mathcal {S}}\). Any sequence in the data set, denoted by s(\(\in {\mathcal {S}}\)), is made of symbols in set \({\mathcal {V}}\). A sequence can have instances of one or many symbols from \({\mathcal {V}}\). For example, sequences from a data set, \({\mathcal {S}}\), made of symbols in \({\mathcal {V}}=\{\texttt {A,B,C,D,E}\}\)(suppose) can be \({\mathcal {S}}=\{\texttt {AABAAABCC,DEEDE}\), \(\texttt {ABBDECCABB,}\ldots \}\). The length of a sequence, s, denoted by, \(L^{(s)}\), is equal to the number of events (in this paper, the term “event” is used for a symbol instance) in it. In the sequence, \(s_{l}\) will denote the symbol at position l, where \(l=1,\ldots ,L^{(s)}\) and \(s_{l}\in {\mathcal {V}}\).

We extract a sequence s’s features in the form of “associations” between the symbols, represented as \(\psi _{uv}^{(s)}\), where \(u,v\in {\mathcal {V}}\), are the corresponding symbols, and \(\psi \) is a function of a helper function \(\phi \). \(\phi _{\kappa }(d)\) is a function that takes a “distance,” d, as input, and \(\kappa \) as a tuning hyper-parameter.

2.3 SGT definition

As also explained in Sect. 2.1, SGT extracts the features from the relative positions of events. A quantification for an “effect” from the relative positions of two events in a sequence is given by \(\phi (d(l,m))\), where lm are the positions of the events, and d(lm) is a distance measure. This quantification is an effect of the preceding event on the later event. For example, see Fig. 3a, where u and v are at positions l and m, and the directed arc denotes the effect of u on v.

For developing SGT, we require the following conditions on \(\phi \): (a) strictly greater than 0: \(\phi _{\kappa }(d)>0;\,\forall \,\kappa>0,\,d>0\); (b) strictly decreasing with d: \(\frac{\partial }{\partial d}\phi _{\kappa }(d)<0\); and (c) strictly decreasing with \(\kappa \): \(\frac{\partial }{\partial \kappa }\phi _{\kappa }(d)<0\).

The first condition is to keep the extracted SGT feature, \(\psi =f(\phi )\), easy to analyze, and interpret. The second condition strengthens the effect of closer neighbors. The last condition helps in tuning the procedure, allowing us to change the effect of neighbors.

There are several functions that satisfy the above conditions: e.g., Gaussian, Inverse and Exponential. We take \(\phi \) as an exponential function because it will yield interpretable results for the SGT properties (Sect. 2.4.1) and \(d(l,m)=|m-l|\).

$$\begin{aligned} \phi _{\kappa }(d(l,m))=e^{-\kappa |m-l|},\,\forall \,\kappa>0,\,d>0 \end{aligned}$$
(1)
Fig. 3
figure 3

Illustration of the effect of symbols’ relative positions

In a general sequence, we will have several instances of a symbol pair. For example, see Fig. 3b, where there are five (uv) pairs, and an arc for each pair shows an effect of u on v. Therefore, the first step is to find the number of instances of each symbol pair. The instances of symbol pairs are stored in a \(|{\mathcal {V}}|\times |{\mathcal {V}}|\) asymmetric matrix, \(\varLambda \). Here, \(\varLambda _{uv}\) will have all instances of symbol pairs (uv), such that in each pair instance, v’s position is after u.

$$\begin{aligned} \varLambda _{uv}(s)= & {} \{(l,m):\,s_{l}=u,s_{m}=v,\nonumber \\&l<m,l,m\in 1,\ldots ,L^{(s)}\} \end{aligned}$$
(2)

After computing \(\phi \) from each (uv) pair instance for the sequence, we define the “association” feature \(\psi _{uv}\) as a normalized aggregation of all instances, as shown below in Eqs. (3a) and (3b). Here, \(|\varLambda _{uv}|\) is the size of the set \(\varLambda _{uv}\), which is equal to the number of (uv) pair instances. Eq. (3a) gives the feature expression for a length-sensitive sequence analysis problem because it also contains the sequence length information within it (proved with a closed-form expression under certain conditions in Sect. 2.4.1). In Eq. (3b), the length effect is removed by normalizing \(|\varLambda _{uv}|\) with the sequence length \(L^{(s)}\) for length-insensitive problems (shown in Sect. 2.4.1).

figure a

and \(\varPsi (s)=[\psi _{uv}(s)],\,u,v\in {\mathcal {V}}\) is the SGT feature representation of sequence s.

For illustration, the SGT feature for symbol pair \(\texttt {(A,B)}\) in sequence in Fig. 1 can be computed as (for \(\kappa =1\) in length-sensitive SGT): \(\varLambda _{\texttt {AB}}=\{(2,5);(3,5);\,(2,8);\,(3,8);\,(6,8)\}\) and \(\psi _{\texttt {AB}}=\frac{\sum _{\forall (l,m)\in \varLambda _{\texttt {AB}}}e^{-|m-l|}}{|\varLambda _{\texttt {AB}}|}= \frac{e^{-|5-2|}+e^{-|5-3|}+e^{-|8-2|}+e^{-|8-3|}+e^{-|8-6|}}{5}=0.066\).

The features, \(\varPsi ^{(s)}\), can be either interpreted as a directed “graph,” with edge weights, \(\mathcal {\psi }\), and nodes in \({\mathcal {V}}\) or vectorized to a \(|{\mathcal {V}}|^{2}\)-vector denoting the sequence s in the feature space.

2.4 SGT properties

2.4.1 Short- and long-term features

In this section, we show SGT’s property of capturing both short- and long-term sequence pattern features. This is shown by a closed-form expression for the expectation and variance of the SGT feature, \(\psi _{uv}\), under some assumptions.

Assume a sequence of length L with an inherent pattern: u,v occur closely together within a stochastic gap as \(X\sim N(\mu _{\alpha },\sigma _{\alpha }^{2})\), and the intermittent stochastic gap between the pairs as \(Y\sim N(\mu _{\beta },\sigma _{\beta }^{2})\), such that, \(\mu _{\alpha }<\mu _{\beta }\) (see Fig. 4). X and Y characterize the short- and long-term patterns, respectively. Note that this assumption is only for showing an interpretable expression and is not required in practice.

Fig. 4
figure 4

Representation of short- and long-term dependencies

Theorem 1

The expectation and variance of SGT feature, \(\psi _{uv}\), has a closed-form expression under the above assumption, which shows that it captures both short- and long-term patterns present in a sequence in both length- sensitive and insensitive SGT variants.

$$\begin{aligned} E[\psi _{uv}]&= {\left\{ \begin{array}{ll} \frac{2}{pL+1}\gamma ; &{} \text {length sensitive}\\ \frac{2L}{pL+1}\gamma ;&{} \text {length insensitive} \end{array}\right. } \end{aligned}$$
(4)
$$\begin{aligned} \text {var}(\psi _{uv})&= {\left\{ \begin{array}{ll} \left( \frac{1}{pL(pL+1)/2}\right) ^{2}\pi ; &{} \text {length sensitive}\\ \left( \frac{1}{p(pL+1)/2}\right) ^{2}\pi ; &{} \text {length insensitive} \end{array}\right. } \end{aligned}$$
(5)

where,

$$\begin{aligned} \gamma&= \frac{e^{-{\tilde{\mu }}_{\alpha }}}{\left| \left( 1-e^{-{\tilde{\mu }}_{\beta }}\right) \left[ 1-\frac{1-e^{-pL{\tilde{\mu }}_{\beta }}}{pL(e^{{\tilde{\mu }}_{\beta }}-1)}\right] \right| } \end{aligned}$$
(6)
$$\begin{aligned} \pi&= \frac{e^{-2{\tilde{\mu }}{}_{\alpha }}}{1-e^{-2{\tilde{\mu }}{}_{\beta }}}\left( pL-e^{-2{\tilde{\mu }}_{\beta }}\left( \frac{1-e^{-2pL{\tilde{\mu }}{}_{\beta }}}{1-e^{-2{\tilde{\mu }}{}_{\beta }}}\right) \right) \end{aligned}$$
(7)

and, \({\tilde{\mu }}_{\alpha }=\kappa \mu _{\alpha }-\frac{\kappa ^{2}}{2}\sigma _{\alpha }^{2};{\tilde{\mu }}_{\beta }=\kappa \mu _{\beta }-\frac{\kappa ^{2}}{2}\sigma _{\beta }^{2}\), \(p\rightarrow \text {constant}\).

Proof

Given in “Appendix A”.

As we can see in Eq. (4), the expected value of the SGT feature is proportional to the term \(\gamma \). The numerator of \(\gamma \) contains information about the short-term pattern, and its denominator has long-term pattern information.

In Eq. (6), we can observe that if either of \(\mu _{\alpha }\) (the closeness of u and v in the short-term) and/or \(\mu _{\beta }\) (the closeness of u and v in the long-term) decreases, \(\gamma \) will increase, and vice versa. This emphasizes two properties: (a) the SGT feature, \(\psi _{uv}\), is affected by changes in both short- and long-term patterns, and (b) \(\psi _{uv}\) increases when uv becomes closer in the short- or long- range in the sequence, providing an analytical connection between the observed pattern and the extracted feature. Besides, it also proves the graph interpretation of SGT: \(\psi _{uv}\) that denotes the edge weight for nodes u and v (in the SGT-graph) increases if closeness between uv increases in the sequence, meaning that the nodes become closer in the graph space (and vice versa). Importantly, \(\lim _{L\rightarrow \infty }\text {var}(\psi _{uv})\rightarrow 0\) ensures feature stability.

In the length-insensitive SGT feature expectation in Eq. (4), it is straightforward to show that it becomes independent of the sequence length as the length increases. As sequence length, L, increases, the (uv) SGT feature approaches a constant, given as \(\lim _{L\rightarrow \infty }E[\psi _{uv}] \rightarrow \frac{2}{p}\left| \frac{e^{-{\tilde{\mu }}_{\alpha }}}{1-e^{-{\tilde{\mu }}_{\beta }}}\right| \).

Besides, for this \(\lim _{L\rightarrow \infty }\text {var}(\psi _{uv})\underset{1/L}{\rightarrow }0\). Thus, the expected value of the SGT feature becomes independent of the sequence length at a rate of inverse to the length. In our experiments, we observe that the SGT feature approaches a length-invariant constant when \(L>30\).

$$\begin{aligned} \lim _{L\rightarrow \infty }\Pr \left\{ \psi _{uv}=\frac{2}{p}\left| \frac{e^{-{\tilde{\mu }}_{\alpha }}}{1-e^{-{\tilde{\mu }}_{\beta }}}\right| \right\}&\underset{1/L}{\rightarrow }&1 \end{aligned}$$
(8)

Furthermore, the length-sensitive SGT feature expectation in Eq. (4) contains the sequence length, L. This shows that the SGT feature has the information of the sequence pattern, as well as the sequence length. This enables an effective length-sensitive sequence analysis because sequence comparisons via SGT will require both patterns and sequence lengths to be similar.

Additionally, for either case, if the pattern variances, \(\sigma _{\alpha }^{2}\) and \(\sigma _{\beta }^{2}\), in the above scenario are small, \(\kappa \) allows regulating the feature extraction: higher \(\kappa \) reduces the effect from long-term patterns and vice versa. \(\square \)

2.4.2 Uniqueness of SGT sequence encoding

The properties discussed above play an important role in SGT’s effectiveness. Due to these properties, unlike the methods discussed in Sect. 1.1, SGT can capture higher orders of relationships without any increase in computation. Besides, SGT can effectively find sequence features without the need for any hidden string/state(s) search.

In this section, we show an additional property of SGT useful for sequence encoding while answering, is SGT feature for a sequence unique? Yes and no. Based on Theorem 2 given below, a stack of SGTs computed for sufficiently different values of \(\kappa \) will be a unique representation of a sequence. This representation can also be used for sequence encoding.

However, in a typical sequence mining problems, we require sequences with similar (same) patterns to be close (equal) in its feature space. This makes data separation in clustering and boundary computation in classification easier. Therefore, stacking SGTs is usually not required as also found in our results.

Theorem 2

A stack of SGTs for a sequence s, \(\varPsi ^{(\kappa )}(s), \kappa = 1,2,\ldots \) uniquely characterizes the sequence.

Proof

The theorem can be proved if we prove that a sequence s can be reconstructed from SGT components \({\mathbf {W}}^{(\kappa )}, \kappa = 0, 1,\ldots , K\) with probability 1 as \(K \rightarrow \infty \), given its length L.

For reconstruction, we have to find the elements present at each position, \(x_l, l=1,\ldots , L\).

\({\mathbf {W}}^{(0)}\) gives the initialization of the number of occurrences of each paired instances of elements \(u,v \in {\mathcal {V}}\).

We solve the following system of equations where the unknowns are, \(x_l, l=1,\ldots , L\) using the known \({\mathbf {W}}^{(\kappa )}, \kappa = 1, 2, \ldots \)

$$\begin{aligned} \sum e^{-\kappa |x_l - x_m|}= & {} W^{(\kappa )}_{uv}, u,v\in {\mathcal {V}} \end{aligned}$$
(9)

Solution of this system of equations will yield multiple solutions for \(x_l, l=1,2,\ldots ,L\). Suppose the set of solutions after solving the system of Eq. (9) for \(\kappa =1,\ldots ,L\) is \(\alpha \).

Since \({\mathbf {W}}^{(\kappa )}, \kappa = 1, 2, \ldots \) are independent (see proof in “Appendix B”), adding another system of equations for \(\kappa = K+1\) will result into a reduced set of solutions \(\alpha '\), i.e. \(|\alpha | < |\alpha '|\).

Therefore, by induction as \(K \rightarrow \infty \), \(|\alpha | \rightarrow 1\), i.e. we reach a unique solution which is the reconstructed sequence. \(\square \)

SGTs can be stacked if the objective is to ensure sequences in a data set do not map to the same representation. However, in most sequence mining problems the objective is to identify sequences that have similar inherent patterns. To that end, only one SGT that appropriately captures the long- and short- term patterns is usually sufficient.

3 SGT algorithm

figure b
figure c

We have devised two algorithms for SGT. The first algorithm (see Algorithm 1) is faster for short sequences when the sequence lengths on average are significantly smaller than the size of the symbol set, i.e., \(L<<|{\mathcal {V}}|\). The second (see Algorithm 2) is faster otherwise.

The input to the algorithms is a sequence s, a symbol set \({\mathcal {V}}\) that makes up the sequence, and a tuning parameter \(\kappa \). The symbol set \({\mathcal {V}}\) can be larger than the set of symbols present in the sequence s, i.e., \({\mathcal {V}} \supseteq \{s_i\},\, s_i \in s,\, s_i \ne s_j\, \forall i,j\). If the sequence s belongs to a population of sequences \({\mathcal {S}}\), i.e., \(s \in {\mathcal {S}}\), then the symbol set should comprise of symbols that construct all the sequences in \({\mathcal {S}}\), i.e., \({\mathcal {V}} \leftarrow \cup \{s_i\},\, s_i \in s,\, s_i \ne s_j \, \forall i,j,\, s\in {\mathcal {S}}\).

The algorithms are initialized with two zero square matrices \({\mathbf {W}}^{(0)},{\mathbf {W}}^{(\kappa )}\) of size \(|{\mathcal {V}}|\). Their rows and columns refer to the symbols in \({\mathcal {V}}\), and a cell will denote a value for the corresponding symbols (uv). These matrices will be iteratively updated during the SGT computation. They are expected to be sparse if the symbol set is large. Therefore, a sparse representation of \({\mathbf {W}}^{(0)},{\mathbf {W}}^{(\kappa )}\) can also be used for computational gains. Besides, a length variable L is initialized to 1 and 0 in Algorithms 1 and  2, respectively. L is also updated during the learning iterations and used if the SGT embedding is for a length-insensitive problem (if length-insensitive is True).

\({\mathbf {W}}^{(\kappa )}\) and \({\mathbf {W}}^{(0)}\) denote the numerator and denominator in Eqs. (3a) and (3b), respectively. These terms are computed differently in the two algorithms. In Algorithm 1, the sequence s is traversed element by element in a double nested (ij) iterations (lines 3–4). Inside an (ij) iteration the corresponding symbols \((s_i, s_j)\) are taken from the sequence s. For this \((s_i, s_j)\) the cells \({\mathbf {W}}^{(0)}_{s_i,s_j}\) and \({\mathbf {W}}^{(\kappa )}_{s_i,s_j}\) are incremented. These are one-step increments for every instance of \((u,v) \forall u,v \in {\mathcal {V}}\) in s.

Differently, instead of traversing the sequence Algorithm 2 traverses the symbol set \({\mathcal {V}}\) in a double nested (uv) iterations (lines 9–11). This is computationally more efficient for long sequences that have relatively smaller symbols set.

To facilitate this algorithm, a helper function GetSymbolPositions is defined. It returns a \(\{u:\, \{position\}\}\) dictionary where \(u\in {\mathcal {V}}\) and \(\{position\}\) is the list of indexes at which u is present in s.

Inside a (uv) iteration the positions of the symbols u and v are known. In line 13, the cross product of the positions is taken such that the position of v is after u. The constraint is for a feed-forward sequence and can be omitted for a bidirectional sequence.

The cells \({\mathbf {W}}^{(0)}_{u,v}\) and \({\mathbf {W}}^{(\kappa )}_{u,v}\) are then updated with the net value of the (ij) instances in the cross product. Unlike the update steps in Algorithm 1, the increments here are accumulative.

The sequence length L is computed in the outer iteration loop in both the algorithms. If the embedding is length-insensitive, \({\mathbf {W}}^{(0)}\) is scaled by the sequence length. Finally, the SGT embedding as the \(\kappa \)-th root of the element-wise division of \({\mathbf {W}}^{(\kappa )},{\mathbf {W}}^{(0)}\), i.e., \(\left( \frac{W^{(\kappa )}_{u,v}}{W^{(0)}_{u,v}}\right) ^{\frac{1}{\kappa }}\), is outputted.

The resulting embedding is a \(|{\mathcal {V}}|\times |{\mathcal {V}}|\) matrix. The matrix can be used as is for feature interpretation, visualization, or learning similar to the way an adjacency matrix is used. The embedding is, otherwise, vectorized to a \(|{\mathcal {V}}|*|{\mathcal {V}}|\) vector and used as input to an unsupervised or supervised learning method.

3.1 Complexity

The initialization and post-processing steps in both algorithms are \(O(|{\mathcal {V}}|^2)\) for a dense matrix implementation of the Ws. However, they are sparse if \(L<< |{\mathcal {V}}|\). In this case, these steps can be computed in \(O(L^2 \log {|{\mathcal {V}}|})\) or \(O(L^2)\).

The processing step of Algorithm 1 has double nested iterations along the length of a sequence with unit computations within. Its complexity is \(O(L^{2})\). On the other hand, the processsing in Algorithm 2 has double nesting along the alphabets also with unit computations therein. Therefore, a part of the computation is \(O(|{\mathcal {V}}|^{2})\). Additionally, the helper function GetAlphabetPositions is of order \(O(|{\mathcal {V}}|L)\). Therefore, the processing complexity is \(O(|{\mathcal {V}}|(L+|{\mathcal {V}}|))\). Thus, Algorithm 2 is more suitable if \(L>> |{\mathcal {V}}|\).

The processing computation of SGT algorithms on a data set \({\mathcal {S}}\) with n sequences will consequently be \(O(nL^{2})\) and \(O(n|{\mathcal {V}}|(L+|{\mathcal {V}}|))\) for Algorithms 1 and 2, respectively. However, the SGT embeddings of the sequences \(s\in {\mathcal {S}}\) are independent. Note that the input to the algorithms is only one sequence s. Therefore, the embeddings can be computed in parallel for the sequences in a data set \({\mathcal {S}}\). It will be shown in Sect. 5.3 that parallel computation significantly reduces the runtime by distributing the sequences on several worker nodes and computing their embeddings simultaneously.

The size for a sequence embedding is \(O(|{\mathcal {V}}|^2)\) in the worst case of no sparsity. In a worst case when every symbol \(u\in {\mathcal {V}}\) is present in s, each element in the embedding will be non-zero. Otherwise, if the symbol set is large, typically only a fraction of them are present in s. In these cases, a sparse representation of the embedding reduces the size by a sparsity fraction r to \(O((r|{\mathcal {V}}|)^2), r\in (0,1)\). Therefore, although the size is quadratic with respect to \(|{\mathcal {V}}|\) it is not an impediment.

Both the time complexity and size are independent of the tuning parameter \(\kappa \). \(\kappa \) adjusts the amount of short- and long-term dependencies captured in SGT embedding (refer to Sect. 2.4.1). The computation complexities being independent of \(\kappa \) gives a significant advantage to SGT. Unlike the existing methods, the short- and long-term dependencies to include can be tuned based on the problem and not restricted by computation limitations.

3.2 Parameter selection

SGT embedding has only one tuning parameter \(\kappa \). A small value of \(\kappa \) allows longer-term dependencies captured in the embedding and vice-versa. In the implementation shown in this paper, \(\kappa \) is chosen from \(1,2,\ldots ,10\). Although fractional values can also be taken but SGT’s performance is insensitive to minor differences in \(\kappa \). Therefore, \(\kappa \) is chosen as integers in this paper.

The optimal selection of \(\kappa \) depends on the problem at hand. If the end objective is building a supervised learning model, methods such as cross-validation can be used. For unsupervised learning, any goodness-of-fit criteria can be used for the selection. In cases of multiple parameter optimization, e.g. the number of clusters (say, \(n_{c}\)) and \(\kappa \) together in clustering, we can use a random search procedure. In such a procedure, we randomly initialize \(n_{c}\), compute the best \(\kappa \) based on some goodness-of-fit measure, then fix \(\kappa \) to find the best \(n_{c}\), and repeat until there is no change.

4 Experimental analysis

SGT’s efficacy can be assessed based on its ability to find (dis)similarity between sequences. Therefore, we built a sequence clustering experimental setup. Clustering requires an accurate computation of (dis)similarity between objects and thus is a good choice for the efficacy test.

We show the following experiments here: (a) Exp-1: length-sensitive sequence problem. (b) Exp-2: length-insensitive with non-parametric sequence pattern, (c) Exp-3: length-insensitive with parametric sequence pattern, and (d) Exp-4: sensitivity analysis against the sample size, symbol set size, and sequence length.

The settings for each of them are given in Table 1. The table shows the mean and standard deviation of the lengths of the sequences generated for each simulation in each experiment. For Exp- 1, 2, and 4, a sequence is generated from some randomly simulated set of motifs (strings) of random lengths (between 2 and 8). These motifs are randomly placed and interspersed with arbitrary strings, which is the noise in a sequence. For experiments reproduction, details of sequence simulation are in “Appendix D”. In Exp-3, sequences are generated for a mixture of Markov and semi-Markov processes as presented in Ferreira and Pacheco (2005), and mixture of Hidden Markov processes from Helske and Helske (2017). In all the experiments, k-means clustering was applied to SGT representations of the sequences. Besides, the publicly available implementations of the benchmark methods were used.

Table 1 Experimentation settings

In Exp-1, we compared SGT with length-sensitive algorithms, viz. MUSCLE, UCLUST, and CD-HIT, which are popular in Bioinformatics. These methods are hierarchical in nature, and thus, they find the optimal number of clusters. For SGT-clustering, the number of clusters is found using the procedure recommended in Sect. 3.2.

Figure 5 shows the results, where the y-axis is the ratio of the estimated best number of clusters, \({\hat{n}}_{c}\), and the truth, \(n_{c}\). The x-axis shows the clustering accuracy. For a best performing algorithm, both metrics should be close to 1. As shown in the figure, CD-HIT and UCLUST overestimated the number of clusters by about twice and five times, respectively. MUSCLE had a better \(n_{c}\) estimate but had about 95% accuracy. On the other hand, SGT could accurately estimate \(n_{c}\) and has a 100% clustering accuracy.

Fig. 5
figure 5

Exp-1 results

Fig. 6
figure 6

Exp-2 results

In Exp-2, we compared SGT with popular and state-of-the-art sequence analysis techniques, viz. n-gram, String Kernel, mixture Hidden Markov model (HMM), Markov model (MM) and semi-Markov model (SMM)-based clustering. For n-gram, we take \(n=\{1,2,3\}\), and their combinations. In String Kernel, the subsequence length parameter \(k=4\) is taken. For these methods, we provided the known\(n_{c}\) to the algorithms and report the F1-score for accuracy. In this experiment, the clusters are increasingly overlapped to make them difficult to separate. An overlapping cluster implies their seed motifs set have a non-null intersection (see “Appendix D”).

Exp-2’s result in Fig. 6a shows the accuracy (F1-score) and the runtimes in Fig. 6b, where SGT is seen to outperform all others in accuracy. MM and SMM have poorer accuracy because of the first-order Markovian assumption. HMM is found to have comparable accuracy, but its runtime is more than six times that of SGT. String Kernel and N-gram methods’ accuracies lie in between. Low-order n-grams have smaller runtime than SGT but worse accuracy. Interestingly, the 1-gram method is better when overlapping is high, showing the higher-order n-grams’ inability to distinguish between sequences when the overlap is high.

Furthermore, we did Exp-3 to see the performance of SGT in sequence data sets generated from the mixture of parametric distributions, viz. the mixture of HMM, MM and SMM. The objective of this experiment is to test SGT’s efficacy on parametric data sets against parametric methods. In addition to obtaining data sets from mixed HMM and first-order mixed MM and SMM distributions, we also get second-order Markov (MM2) and third-order Markov (MM3) data sets. Figure 7a shows the F1-score and Fig. 7b has the runtimes. As expected, the mixture clustering method corresponding to the true underlying distribution is performing the best. Note that SMM is slightly better than MM in the MM setting because of its over-representative formulation, i.e. a higher dimensional model to include a variable time distribution. However, the proposed SGT’s accuracy is always close to the best. This shows SGT’s robustness to underlying distribution and its universal applicability. And, again, its runtime is smaller than all others.

Fig. 7
figure 7

Exp-3 results

Fig. 8
figure 8

Exp-4 sensitivity analysis: sample size

Fig. 9
figure 9

Exp-4 sensitivity analysis: symbol set size

Fig. 10
figure 10

Exp-4 sensitivity analysis: sequence length

Sensitivity analysis of SGT against the sample size, symbol set size, and sequence length is performed in Exp-4. As shown in Table 1, the sample size, symbol set size, and the mean of sequence lengths ranged from \(\{100, 500, 1000\}\), \(\{20, 50, 100\}\), and \(\{100, 500, 1000\}\), respectively. The standard deviation of the sequence lengths in this analysis was kept very small to accurately measure the effect of the length and, therefore, not mentioned in the table. Besides, the settings for the sensitivity analysis are not extreme values due to the computation limitation of some methods. For example, the 3-gram method failed to yield a result for a symbol set size of 100 on the test computing system.

The results are shown in Figs. 8, 9 and 10. As shown in the charts for the f1-score, SGT embedding based k-means remained robust to the variations in sample size (Fig. 8a), symbol set size (Fig. 9a), and sequence length (Fig. 10a).

Among the benchmark methods, SMM is agnostic to the sequence length because it also more parameterization that incorporates the sequence length. Moreover, the non-parametric n-gram based methods were better than the parametric methods for varying symbol set size expect higher n-gram inability to yield results in tractable time for high feature space.

Besides, while SGT’s performance with the sample size was unaffected, the other methods improved with more samples. The runtime comparison shows that SGT embedding runtime although increases in each variation, is significantly lower than the benchmark methods. Non-parametric methods generally have lower runtime because the time-intensive feature estimation occurs only once. On the other hand, the iterative EM algorithm based estimation in the parametric methods re-computes the features at every iteration.

5 Applications on real data

The sequence mining problem can be broadly categorized as classification, clustering, and search. In the following, real-world examples for each of them are presented. Performance comparisonsFootnote 1 with state-of-the-art methods including deep learning is made on the labeled data in sequence classification. Sequence clustering demonstrated an application for unsupervised learning. Moreover, a sequence search is demonstrated using the parallel computation capability with SGT. The implementation steps are available at https://github.com/cran2367/sgt.

5.1 Sequence classification

Here we perform classification on (a) protein sequencesFootnote 2 having either of two known functions, which act as the labels, and (b) network intrusion dataFootnote 3 containing audit logs and any attack as a positive label.

The data set details are in Table 2. For both problems, we use the length-sensitive SGT. For proteins, it is due to their nature, while for network logs, the lengths are important because sequences with similar patterns but different lengths can have different labels. Consider a simple example of two sessions: {login, pswd, login, pswd,...} and {login, pswd,...(repeated several times)..., login, pswd}. While the first session can be a regular user mistyping the password once, the other session is possibly an attack to guess the password. Thus, the sequence lengths are as important as the patterns.

For the network intrusion data, the sparsity of SGTs was high. Therefore, we performed principal component analysis (PCA) on it and kept the top 10 PCs as sequence features. We call it SGT-PC, for further modeling. For proteins, the SGTs are used directly. SVM classifier is trained on n-grams with an RBF kernel, cost parameter set to 1, and on current state-of-the-art String Kernels (Kuksa et al. 2009), its faster approximate improvement by Farhan et al. (2017) (the k-mer and mismatch lengths are set to 5 and 2, respectively), and a recently developed state-of-the-art Random Features Kernel (Wu et al. 2019). Table 3 reports the average test accuracy (F1-score) from a ten- and five-fold cross-validation for the protein and network data, respectively.

Table 2 Data set attributes
Table 3 Classification accuracy (F1-score) results
Table 4 Deep learning

As we can see in Table 3, the F1-scores are high for all methods in the protein data, with SGT-based SVM surpassing all others. On the other hand, the accuracies are small for the network intrusion data. This is primarily due to (a) a small data set but high dimension (related to the symbol set size), leading to a weak predictive ability of models, and (b) a few positive class examples (unbalanced data) causing a poor recall rate. Still, SGT outperforms other methods by a significant margin.

Furthermore, LSTM models in Deep Learning are state-of-the-art for sequence classification. We compare it with a regular FNN Deep Learning models (an MLP, specifically) in which the SGT features are used as embedding layer. For learning, we use binary cross-entropy loss function and adam optimizer. Tensorflow is used for the implementations.

Table 4a, b shows the results. The accuracies (F1-scores) in the Protein data is close to 1 for all models. SGT powered FNN is only marginally better in accuracy, however, its runtime is a fraction of LSTM’s runtime. In the network data, the accuracies are lower for LSTM models. This can be because, (a) it is a length sensitive sequence problem, (b) the sequence lengths vary significantly. LSTMs may not capture the differences due to lengths. Also, LSTMs pad all sequences to become equal lengths, which may not work as effectively if the differences in lengths are significantly high (LSTMs still performed well on protein data because the difference in the lengths is quite small). On the other hand, the FNN worked reasonably better. Its accuracy is higher than SVM on the other methods but smaller than SVM on SGT. This can be due to a small data set which makes the model training more difficult for a deep learning model. For the same reason, a single layer FNN worked better than two-layer.

5.2 Sequence clustering

We perform clustering user activity on the web (weblog sequences) to understand user behavior.

We took users’ navigation data (weblogs) on msnbc.comFootnote 4 collected during a 24-h period. The symbols of these sequences are the events corresponding to a user’s page request, e.g. frontpage, tech, etc. There are a total of 12 types of events (\(|{\mathcal {V}}|\) = 12). The data set has a random sample of 100,000 sequences. The sequences’ average length and standard deviation is (6.9, 27.3), with the range between (2, 7440) and skewed distribution.

Our objective is to cluster the users with similar navigation patterns, irrespective of differences in their session lengths. We, therefore, take the length-insensitive SGT and use the random search procedure for optimal clustering in Sect. 3.2. We performed k-means clustering and the goodness-of-fit criterion as db-index and found the optimality for \(\kappa =9\) at \(n_{c}=104\), which is close to the result in Cadez et al. (2003). The frequency distribution (Fig. 11) of the number of members in each cluster has a long-tail—the majority of users belong to a small set of clusters.

Fig. 11
figure 11

Clustering

Fig. 12
figure 12

Graphical visualization of cluster centroids

Additionally, SGT enables a visual interpretation of the clusters. In Fig. 12a, b, we show a graph visualization of some clusters’ centroids (which are in the SGT space), a representative of a behavior.

5.3 Parallel computation and sequence search

Typically sequence databases found in the real world are quite large. For example, protein databases have millions of sequences and increasing. Here we show that SGT sequence features can lead to a fast and accurate sequence search. More specifically, we will utilize parallel computation capability possible with SGT.

Table 5 Parallel computing cluster specifications

We collected a sample of 10k and 1M protein sequences from the UniProtKB database on www.uniprot.org. First, we ran a benchmark test for runtime comparison. We computed the SGT embeddings for the two data set with the default mode and parallel computation mode. In the default mode, the embeddings are computed one-by-one. Therefore, the computation will be proportional to the sample size.

In the parallel computation mode, a data set is partitioned into smaller chunks and distributed over several worker nodes. The embeddings are computed in parallel on these worker nodes. Depending on the number of worker nodes and their capacity, the data set can be repartitioned to more chunks and the overall runtime can be reduced.

The specifications of the parallel computing cluster are shown in Table 5. The cluster was hosted on AWS. The driver and worker nodes were r4.xlarge.Footnote 5

The runtimes for SGT embedding computation are presented in Table 6. As shown in the table, the runtime for 10k data set reduced from 13.5 min under the default mode to about 30 s with parallel computation. In this run, the data set was partitioned into 500 chunks. The runtime reduction is more significant for the 1 Million data set. In this case, the data set was partitioned into 10k chunks and the runtime reduced from more than 24 h to less than 30 min.

Table 6 Parallel computation runtime comparison
Table 7 Protein search query (A0A2T0PYE0) result

The embeddings are then stored and a sequence query search is performed. A protein sequence (id: A0A2T0PYE0) is arbitrarily chosen. The objective is to find protein sequences similar to A0A2T0PYE0 in the data set.

At this stage, the embeddings of the database sequences are known. To search sequences similar to the query we compute the embedding for the query A0A2T0PYE0.Footnote 6 The dot product of the query embedding with the embeddings in the database is computed. A dot product is a measure of the similarity. Among the various choices of similarity measure, the dot product is chosen here because it is computationally faster with embeddings.

The sequences with large dot products will have high similarity with the query. The top five similar protein sequences from the data set are shown in Table 7. The top five for the commonly used protein search methods BLAST and CLUSTAL-Omega are also shown for reference.

Only two of the five SGT search results match with BLAST and CLUSTAL-Omega while the result of the latter two are more similar. This is expected because both the latter methods are alignment based. On the other hand, SGT looks for similarity in the distribution of sequence symbol positions. Such a distribution similarity based search is quite applicable on user data, e.g., for behavioral analysis as presented in Sect. 5.2.Footnote 7

The search operation can be further improved with respect to accuracy and speed by applying a dimension reduction using methods like PCA and performing the dot product on the reduced dimension.

6 Discussion: why does SGT work better?

6.1 Ability to work in length- sensitive and insensitive problems

Fig. 13
figure 13

Effect of sequence length on SGT features (\(\kappa = 5\))

We discussed SGT’s length- sensitive and insensitive variants in Sects. 2.3 and 2.4.1. Here we show SGT’s ability to work in both with a real example. Consider the three sequences, s1, s2, and s3, in Fig. 13. The sequences are of lengths 6, 33, and 99, respectively. Their inherent pattern is \(\{\texttt {A}, \texttt {B},\texttt {C}\}\) occurring in succession.

We find their SGT features for both length- sensitive and insensitive variants. We show the SGT features as adjacency matrices in Fig. 13. We will first look at the length insensitive column. We notice that the SGT features s1 and s2 are quite close. And as the sequence lengths increased the SGT features approached a constant value. This can be noted from the length-insensitive SGT features of s2 and s3 which are the same up to two decimals.

On the other hand, in the length-sensitive case, the SGT features keep changing as the length changes. Note that the features reduce consistently as the length increases. This is because, as shown in Theorem 1, the expectation of the length-sensitive SGT features reduce with length. To avoid the features from approaching zero (for high lengths), we can tune the hyperparameter \(\kappa \).

This example reinstates that SGT can effectively take into account the length- sensitivity and insensitivity. Moreover, the features derived from the SGT algorithm in Fig. 13 are approximately equal to the value computed from Eq. (4) confirming the theoretical interpretations in Sect. 2.4.1.

The sequences in this example were noise-free. However, SGT is robust to noise which we will discuss in Sect. 6.3.

6.2 Avoids false positives by inherently accounting for mismatches

In this paper, a false positive is defined as identifying two sequences of different lengths as similar if the smaller sequence is a subsequence of or locally aligns with the longer sequence. Avoiding false positives is a nontrivial challenge in length-insensitive sequence problems. For example, N-gram methods can often lead to such false positives. To address this, mismatch Kernels were developed (Eskin et al. 2003). However, these methods require additional computations for the mismatches or substitutions. On the other hand, SGT inherently accounts for the mismatches.

Consider a small sequence \(\texttt {ABCABC}\) and compare it with \(\texttt {ABCABCABCABCABCABC}\) and ABCABC . Ideally, we require \(\texttt {ABCABC}\) to match with the former but not the latter. As shown in Table 8, SGT feature comparisons achieve this and, thus, avoids a false positive.

Table 8 SGT accounting for mismatches (\(\kappa = 5\))

6.3 Robust to noise

To explain SGT’s robustness to noise we will draw parallels with Markov Model. Compare SGT with a first-order Markov model. Suppose we are analyzing sequences in which “\(\texttt {B}\) occurs closely after \(\texttt {A}\).” Due to stochasticity, the observed sequences can be like: (a) \({{\mathbf {\mathtt{{AB}}}}}{} \texttt {CDE}{{\mathbf {\mathtt{{AB}}}}}\), and (b) \({{\mathbf {\mathtt{{AB}}}}}{} \texttt {CDE}{{{\mathbf {\mathtt{{A}}}}}}{} \texttt {X}{{{\mathbf {\mathtt{{B}}}}}}\), where (b) is same as (a) but a noise \(\texttt {X}\), appearing in between \(\texttt {A}\) and \(\texttt {B}\). While their transition probabilities \(P(A \rightarrow B)\) in sequence (a) and (b) in a Markov model is significantly different (a:1.00 and b:0.50), SGT is robust to such noises. The SGT feature for (\(\texttt {A}\),\(\texttt {B}\)) for length- sensitive and insensitive scenarios are (a: 0.50 and b: 0.45), and (a: 0.34 and b: 0.30), respectively, for \(\kappa =5\). As shown in Fig. 14, the percentage change in the SGT feature for (\(\texttt {A}\), \(\texttt {B}\)), in the above case, is smaller than the Markov and decreases with increasing \(\kappa \). It also shows that we can easily regulate the effect of such stochasticity by changing \(\kappa \): choose a high \(\kappa \) to reduce the noise effect, with a caution that sometimes the interspersed symbols may not be noise but part of the sequence’s pattern (thus, we should not set \(\kappa \) as a high value without a validation). Furthermore, a Markov model cannot easily distinguish between these two sequences: \({{\mathbf {\mathtt{{AB}}}}}{} \texttt {CDE}{{\mathbf {\mathtt{{AB}}}}}\) and \({{\mathbf {\mathtt{{AB}}}}}{} \texttt {CDEFGHIJ}{{\mathbf {\mathtt{{AB}}}}}\), from the (\(\texttt {A}\),\(\texttt {B}\)) transition probabilities (=1 for both). Differently, the SGT feature for (\(\texttt {A}\),\(\texttt {B}\)) changes from 1.72 to 2.94 (\(\kappa =1\)), because it looks at the overall pattern.

Fig. 14
figure 14

Percentage change in SGT feature for (\(\texttt {A}\),\(\texttt {B}\)) with \(\kappa \) in the presence of noise

7 SGT extensions

7.1 SGT for bidirectional sequences

As also mentioned in Sect. 2.1, in a bidirectional sequence the chronological order of events does not matter, e.g. proteins. This is also the case with weblog data, such as music listening history, where sometimes we want to understand the tracks that were listened together and not the order in which they were played.

SGT for bidirectional sequences can be computed by just changing the definition of \(\varLambda _{uv}(s)\) in Eq. (2) to,

$$\begin{aligned} {\tilde{\varLambda }}_{uv}(s)= & {} \{(l,m):\,s_{l}=u,s_{m}=v,\nonumber \\&l,m\in 1,\ldots ,L^{(s)}\} \end{aligned}$$
(10)

Below we show that under the assumption—all the symbols are present in a sequence with a uniform probability—, the bidirectional SGT features can be approximated directly from the directed.

We can write Eq. (10) as,

$$\begin{aligned} {\tilde{\varLambda }}_{uv}(s)= & {} \{(l,m):\,s_{l}=u,s_{m}=v,l,m\in 1,\ldots ,L^{(s)}\}\\= & {} \{(l,m):\,s_{l}=u,s_{m}=v,l<m,l,m\in 1,\ldots ,L^{(s)}\}\\&+\{(l,m):\,s_{l}=u,s_{m}=v,l>m,l,m\in 1,\ldots ,L^{(s)}\}\\= & {} \varLambda _{uv}(s)+\varLambda _{uv}^{T}(s) \end{aligned}$$

Therefore, the SGT for the bidirectional sequence in Eq. (3a) can be expressed as,

$$\begin{aligned} {\tilde{\varPsi }}_{uv}(s)= & {} \frac{\sum _{\forall (l,m)\in {\tilde{\varLambda }}_{uv}(s)}\phi _{\kappa }(d(l,m))}{|{\tilde{\varLambda }}_{uv}(s)|}\\= & {} \frac{\sum _{\forall (l,m)\in \varLambda _{uv}(s)}\phi _{\kappa }(d(l,m))+\sum _{\forall (l,m)\in \varLambda _{uv}^{T}(s)}\phi _{\kappa }(d(l,m))}{|\varLambda _{uv}(s)|+|\varLambda _{uv}^{T}(s)|}\\= & {} \frac{|\varLambda _{uv}(s)|\varPsi _{uv}(s)+|\varLambda _{uv}^{T}(s)|\varPsi _{uv}^{T}(s)}{|\varLambda _{uv}(s)|+|\varLambda _{uv}^{T}(s)|} \end{aligned}$$

Under the above assumption,

$$\begin{aligned} \varLambda _{uv}(s)\sim & {} \varLambda _{uv}^{T}(s) \end{aligned}$$
(11)

Therefore, the bidirectional SGT features can be approximated as

$$\begin{aligned} {\tilde{\varPsi }}\sim \frac{\varPsi +\varPsi ^{T}}{2} \end{aligned}$$
(12)

7.2 SGT for symbol clustering

Fig. 15
figure 15

Illustrative example for symbol clustering

Node clustering in graphs is a classical problem solved by various techniques, including spectral clustering, graph partitioning, and others. SGT’s graph interpretation facilitates grouping of symbols that occur closely via any of these node clustering methods.

This is because SGT gives larger weights to the edges, \(\psi _{uv}\), corresponding to symbol pairs that occur closely. For instance, consider a sequence in Fig. 15a, in which v occurs closer to u than w, also implying \(E[X]<E[Y]\). Therefore, in this sequence’s SGT, the edge weight for \(u\) \(\rightarrow \) \(v\) should be greater than for \(u\) \(\rightarrow \) \(w\), i.e. \(\psi _{uv}>\psi _{uw}\).

From the assumption in Sect. 2.4.1, we will have, \(E[|\varLambda _{uv}|]=E[|\varLambda _{uw}|]\). Therefore, \(\psi _{uv}\propto E[\phi (X)]\) and \(\psi _{uw}\propto E[\phi (Y)]\), and due to Condition b on \(\phi \) given in Sect. 2.3, if \(E[X]<E[Y]\), then \(E[\psi _{uv}]>E[\psi _{uw}]\).

Moreover, for an effective clustering, it is important to bring the “closer” symbols in the sequence more close in the graph space. In the SGT’s graph interpretation, it implies that \(\psi _{uv}\) should go as high as possible to bring v closer to u in the graph and vice versa for (uw). Thus, effectively, \(\varDelta =E[\psi _{uv}-\psi _{uw}]\) should be increased. It is proved in “Appendix B” that \(\varDelta \) will increase with \(\kappa \), if \(\kappa d>1, \forall d \), where we have \(d \in {\mathbb {N}}\).

In effect, SGT enables the clustering of associated symbols. This has real-world applications, such as finding the webpages (or products) that are viewed (or bought) together.

Fig. 16
figure 16

Node clustering experiment result

7.2.1 Validation

We validated the efficacy of the SGT extension given above (Sects. 7.1, 7.2) in another experiment presented here. Our main aim in this validation is to perform symbol clustering assuming the sequences are bidirectional. We set up a test experiment such that across different sequence clusters some symbols occur closer to each other. We create a data set that has sequences from three clusters and symbols belonging to two clusters (symbols A–H in one cluster and I–P in another). The mean and standard deviation of the simulated sequence lengths is (103.9, 33.6). The noise is at 30–50% and the number of underlying sequence clusters is equal to three.

This emulates a biclustering scenario in which sequences in different clusters have distinct patterns; however, the pattern of closely occurring symbols is common across all sequences. This is a complex scenario in which clustering both sequences and symbols can be challenging.

Upon clustering the sequences, the F1-score is found to be 1.0. For symbol clustering, we applied spectral clustering on the aggregated SGT of all sequences, which yielded an accurate result with only one symbol as mis-clustered. Moreover, a heat map in Fig. 16 clearly shows that symbols within the same underlying clusters have significantly higher associations. Thus, it validates that SGT can accurately cluster symbols along with clustering the sequences.

8 Conclusion

SGT is found to yield superior accuracy over other methods in sequence mining. This can be attributed to SGT (a) effectively capturing short- and long-term patterns in both length- sensitive and insensitive problems, (b) it inherently accounts for sequence mismatches to avoid false positives, and, (c) it is robust to noise in sequence patterns. These attributes are discussed in detail in Sects. 6.16.3. Moreover, SGT has significantly lower runtimes due to its computational efficiency and is also easy to implement. It can be further improved by implementing a sparse data structure for the W matrices in the algorithms. In addition to the above-mentioned applications, SGT can also be used for: (a) element (symbol) clustering, (b) sequence database search, and (c) sequence encoding, shown in the Applications section.

Besides, some preliminary work shows the possibility of a 2-D SGT applicable to image data allowing invariance to orientation. Moreover, other choices for the function \(\phi \), such as Gaussian, or addition of a skip parameter r (for addressing lag effects), for example, \(e^{-\kappa \max {(d-r, 0)}}\), and application of concatenated (stacked) SGT features for different \(\kappa \) may be taken as future research. Furthermore, a formal approach for \(\kappa \) selection can be developed.