Keywords

1 Introduction

Recently, many state-of-the-art time series classification methods use Artificial Neural Networks (ANN), such as Recurrent Neural Networks (RNN) [29] and Temporal Convolutional Neural Networks (TCNN) [38]. They have shown to be widely effective for time series and signal classification [3, 38]. However, while neural networks might be effective for time series, they do not inherently have considerations for some aspects and elements of signals, such as motifs and discords.

Time series motifs are repeated subsequences within a time series and discords are anomalies in times series. Finding motifs and discords, or the field of motif discovery, is essential for finding patterns in time series. Motif discovery has been used for time series analysis in many domains, such as protein sequences [22, 41], actions [36], sounds [9], and signals [23, 25].

Fig. 1.
figure 1

Example of time series and the resulting Matrix Profile. The green brackets of (a) and (b) are an example of a motif, and the orange brackets of (c) and (d) are an example of a discord. (Color figure online)

One powerful tool used for motif discovery is Matrix Profile [40]. Matrix Profile is a robust and scalable data structure that helps with motif discovery. Specifically, Matrix Profile is a sequence calculated based on a sliding window of subsequences and the distance to its nearest neighbor subsequence. An example time series and Matrix Profile calculation result is shown in Fig. 1. In the figure, the dips in the Matrix Profile correspond to the locations of the motifs and the peaks are discords. The use of Matrix Profile has shown to be effective at large-scale motif discovery [39, 42].

We propose the inclusion of motif and discord information as supplemental information for signal classification. Namely, Matrix Profile is used to improve the classification ability of temporal neural networks by providing additional motif-based features alongside the original signal features. This is done by considering the Matrix Profile vector as a sequence and combining it with the original time series features in fusion neural networks. These motif-based features can be considered a self-augmented extra modality to represent the signal. Therefore, we are able to use both features in a single multi-modal model. Through this, we demonstrate that the motif-based features can supplement the original time series features and improve classification.

The contribution of this paper is as follows:

  • We propose the use of Matrix Profile-based features to supplement time series in classification. This is done using fusion networks to combine the features.

  • We demonstrate that the proposed method can improve the accuracy of neural networks in signal classification. To do this, we evaluate the proposed method on 24 datasets taken from the 2018 University of California Riverside (UCR) Time Series Archive [7]. The 24 datasets are all of the sensor and device datasets with at least 100 training patterns from the archive.

  • We examine the effect that the window size of Matrix Profile has on the accuracy of the proposed method.

  • The code for the proposed method can be found at

    https://github.com/uchidalab/motif-based-features

2 Related Work

The use of fusion neural networks is a common solution for multi-modal data recognition [11]. They have been used for a wide range of applications. Of them, there are a few works that use fusion neural networks with different features extracted from the same time series. For example, the Long Short-Term Memory Network Fully Convolutional Network (LSTM-FCN) [17] uses a fusion network to combine an LSTM branch and an FCN branch for time series classification. Similarly, Song et al. [34] combine the features from an LSTM and a CNN for time series retrieval. Features can also be derived or learned from the original time series representation. For example, Iwana et al. [16] propose using local distance-based features with the original features in fusion 1D CNNs and Oba et al. [24] combines data augmentation methods in a gated fusion network. Matsuo et al. [20] uses a learned self-augmentation by converting the time series into images and then uses a multi-modal network. Wang et al. [37] uses a fusion network with multi-scale temporal features and distance features.

3 Using Matrix Profile as a Feature Extraction Method

3.1 Motif Discovery

A motif is a repeated pattern in a time series. Specifically, given time series \(\textbf{t} = t_1, \dots , t_n, \dots , t_N\) of length N and \(t_n \in \mathbb {R}\), a continuous subsequence \(\textbf{t}_{s,M}=t_s, \dots , t_{s+M-1}\) of length M starting from position s, where \(1 \le s \le N-M+1\), is a motif if it shares similar values with any other subsequence \(\textbf{t}_{s',M}\) within \(\textbf{t}\) with a different start position \(s'\). Note, time series element \(t_n\) can one dimension (univariate) or multiple dimensions (multivariate).

Motif discovery refers to finding sets of similar short sequences in a large time series dataset. Motifs are essential as these primitive patterns can be used as inputs for algorithms to perform segmentation, classification, anomaly detection, etc. Further, studying motifs can provide insight into the functional properties of the time series [43].

3.2 Matrix Profile

Matrix Profile [40] is a powerful motif discovery algorithm that represents times series based on the distances of subsequences to their nearest neighbors. Specifically, using a sliding window, we can extract all of the subsequences from the time series. We can then compute the pairwise distances between these subsequences and store them in the form of a matrix. This matrix is then stored in a vector that only holds the information on the distances of each subsequence to the nearest neighbor of that subsequence. This vector is called the Matrix Profile.

Namely, given time series \(\textbf{t}\), first the all-subsequences set is created. The all-subsequences set \(\mathcal {A}\) is an ordered set of all possible subsequences of time series \(\textbf{t}\), obtained by a sliding window of length M across \(\textbf{t} = t_1, \dots , t_{N-M+1}\), where M is a user-defined subsequence length. We use \(\mathcal {A}[s]\) to denote the subsequence \(\textbf{t}_{s,M}=t_s, \dots , t_{s+M-1}\).

Next, a Distance Profile \(\textbf{d}_i\) is created for each subsequence in \(\mathcal {A}\). The Distance Profile is the ordered vector of distances between each subsequence in all-subsequences set \(\mathcal {A}\) and all other subsequences in \(\mathcal {A}\). For this distance, traditionally, the Euclidean distance is used. Using each Distance Profile, a similarity join set \(\mathcal {S}\) is constructed with each subsequence \(\mathcal {A}\) with its nearest neighbor,

$$\begin{aligned} \mathcal {S}_s = \mathop {\mathrm {arg\,min}}\limits _{s'} \textbf{d}_s[s'], \end{aligned}$$
(1)

for each s-th subsequence of \(\mathcal {A}\). Matrix Profile \(\textbf{p}\) is the vector of distances between \(\mathcal {A}\) and \(\mathcal {S}\), or:

$$\begin{aligned} \textbf{p} = ||\mathcal {A}_{1} - \mathcal {S}_{1}||, \dots , ||\mathcal {A}_{s} - \mathcal {S}_{s}||, \dots , ||\mathcal {A}_{N-M+1} - \mathcal {S}_{N-M+1}||. \end{aligned}$$
(2)

An example of the result of a Matrix Profile calculation is shown in Fig. 1. Matrix Profile has many advantages over conventional methods of motif discovery and anomaly detection representations. To state a few, it is space efficient, parallelizable, scalable, and can be used efficiently on streams of data [40].

3.3 Matrix Profile as a Motif-Based Feature

As described previously, Matrix Profile is a vector that is used to identify motifs and discords by containing the distance of each subsequence to its nearest neighbor. In other words, the values of the Matrix Profile will be small for repeated subsequences and large for anomalies. The values of the Matrix Profile vector have a nonlinear relationship to the original time series features. Thus, it is possible to exploit Matrix Profile to create a feature vector that contains information that is not inherent to the original time series features.

In order to use the Matrix Profile features, we consider vector \(\textbf{p}\) as a sequence, or:

$$\begin{aligned} \textbf{f}=\textbf{p}^\top =p_1, \dots , p_s, \dots , p_{N-M+1}. \end{aligned}$$
(3)

This gives a sequence \(\textbf{f}\) of length \(N-M+1\), which is similar in size to the original time series features \(\textbf{t}\). The motif feature sequence \(\textbf{f}\) can now be used alongside the original \(\textbf{t}\) in multi-modal classification.

4 Multi-modal Classification with Fusion Neural Networks

The original features \(\textbf{t}\) and the motif features \(\textbf{f}\) are different modalities that contain different information about the same signal. Therefore, we combine the two features in one multi-modal model to improve classification. Using the additional motif features can supplement the original time series.

Fig. 2.
figure 2

Different arrangements of combining features in multi-modal fusion networks.

There are various methods of creating a multi-modal classification model. We propose to use a fusion neural network. Specifically, we implement a multi-modal neural network and combine the modalities through model fusion. The modalities can be combined at different points in the neural network. Some common places modality fusion can take place is at the input level or at the feature level, as shown in Fig. 2. Input-level fusion combines the inputs by concatenating them. The combined input is then used with a typical temporal neural network. Feature-level fusion concatenates separate modality branches within a network.

5 Experimental Results

5.1 Data

In order to evaluate the proposed method, we use 24 time series datasets from the UCR Time Series Archive [7]. The datasets are all of the device and sensor type datasets with at least 100 training patterns. There are 8 device datasets, ACSF1, Computers, ElectricDevices, LargeKitchenAppliances, PLAID, RefrigerationDevices, ScreenType, and SmallKitchenAppliances and 16 sensor datasets, AllGestureWiimoteX, AllGestureWiimoteY, AllGestureWIimoteZ, ChlorineConcentration, Earthquakes, FordA, FordB, FreezerRegularTrain, GesturePebbleZ1, GesturePebbleZ2, InsectWingbeatSound, Phoneme, Plane, StarLightCurves, Trace, and Wafer. The sensor and device datasets are used since they are examples of signal data. However, there is no theoretical limitation to the type of time series used. Furthermore, datasets with less than 100 training patterns are not used because very small datasets are not suitable for neural networks.

The datasets used in the experiments have a wide range of lengths. The dataset with the shortest time series is the ElectricDevices dataset with 96 time steps and the longest one is the ACSF1 dataset with 1,460 time steps. Six of the datasets, AllGestureWiimoteX, AllGestureWiimoteY, AllGestureWiimoteZ, GesturePebbleZ1, GesturePebbleZ2, and PLAID have varying number of time steps. For these datasets, we pre-process them by post pattern zero padding. In addition, all the datasets except for the six previously mentioned datasets were already z-normalized by the creators of the datasets.

For the motif-based features, the Matrix Profile algorithm is applied to the signal. We use a Matrix Profile window size of 7% of the longest time series in the training dataset. This window size is determined through a parameter search, shown in Sect. 5.6. Also, for the input-fusion networks, because the feature sequence lengths are different by \(M-1\), the motif features \(\textbf{f}\) are post zero-padded. Finally, the motif features are z-normalized based on the training set.

5.2 Architecture and Settings

Two time series recognition architectures were used as the foundation of the experiments, a 1D Very Deep Convolutional Network (VGG) [33] and a Bidirectional LSTM (BLSTM) [31]. The 1D VGG is a VGG adapted for time series in that it uses 1D convolutions instead of the standard 2D convolutions. It has multiple blocks of convolutional layers followed by max pooling layers. There are two fully-connected layers with 1,024 nodes and dropout with a probability of 0.5. The number of blocks and filters per convolution are determined by the suggestions of Iwana and Uchida [15]. For the BLSTM, there are two layers of 100 cells each. The hyperparameters of the BLSTM are the optimal suggestions by Reimers et al. [28]. In the case of the feature-fusion networks, two streams with the same hyperparameters are used and the concatenation is performed before the first fully-connected layer of each.

For training the 1D VGG, Stochastic Gradient Decent (SGD) with an initial learning rate of 0.01, momentum of 0.9, and weight decay of \(5\times 10^{-4}\) is used. These settings are suggested by [33]. For the BLSTM, following [28], we use an Nadam [8] optimizer with an initial learning rate of 0.001. For both networks, we use batch size 50 and train for 10,000 iterations. The datasets used have fixed training and test sets that are provided by the dataset authors.

5.3 Comparison Methods

To demonstrate the effectiveness of the proposed method, the following evaluations were performed:

  • Single-Modality Network with Time Series Features (Single, TS): The original time series features are used as a baseline.

  • Single-Modality Network with Matrix Profile Features (Single, MP): This uses the same networks as TS, but using only the Matrix Profile-based features. Single MP (7%) refers to using a 7% window for matrix profile. Single MP (Best) uses the best window for each dataset.

  • Input-Level Fusion with Time Series and Matrix Profile Features (Input Fusion, TS+MP): For the input-level fusion network, the time series features and Matrix Profile features are concatenated in the dimension direction and then fed to the neural network.

The evaluation of all of the comparisons use the same hyperparameters and training scheme, with exception of the feature-level fusion which has two modality streams with their own set of feature extraction layers.

Table 1. Average Test Accuracy (%) Trained Five Times

5.4 Results

Training and testing were performed five times, and the average of the five results was recorded as the final value in order to obtain an accurate representation of the accuracy of each method. The results are shown in Table 1. For the applicable comparisons, the window size was set to a fixed percentage of the longest length time series in the training set. For the MP (7% Win.) features, all datasets use a window of 7% of the maximum time series length. For MP (Best Win.), the best accuracy is used for each dataset.

From the table, it can be seen that input fusion with the time series features and the proposed Matrix Profile features had the highest accuracy for most datasets. The results with the best window for each dataset predictably had the highest accuracy. However, a window of 7% of the time series length still performed better than without using the fusion network. This is true for both the 1D VGG and the BLSTM.

Fig. 3.
figure 3

Critical difference diagram using a Nemenyi test comparing the proposed method to reported methods. Green highlighted methods are the proposed input fusion networks. (Color figure online)

We also compare the proposed method to other comparison methods. A Nemenyi test is performed using results reported from literature. Figure 3 compares the proposed BLSTM and VGG using TS+MP (7%) and (Best) to Bag of Patterns (BoP) [18], Bag of Symbolic Fourier Approximation Symbols (BOSS) [30], Collective of Transformation Ensembles (COTE) [2], Complexity Invariant Distance (CID) [4], Derivative DTW (DD\(_{DTW}\)) [12], Derivative Transform Distance (DTD\(_C\)) [13], Elastic Ensemble (EE) [19], Fast Shapelets (FS) [26], Feature Fusion CNN using Local Distance Features and series features (CNN LDF+TS) [16], Learned Pattern Similarity (LPS) [6], Learned Shapelets (LS) [21], Multilayer Perceptron (MLP) [1], Random Forest (RandF) [1], Residual Network (ResNet) [10], Rotation Forest (RotF) [1], Shapelet Transform (ST) [14], Symbolic Aggregate Approximation - Vector Space Model (SAXVSM) [32], SVM with a linear kernel (SVML) [1], Time Series Bag of Features (TSBF) [5], SVM with a quadratic kernel (SVMQ) [1], 1-NN with Euclidean Distance (1-NN ED) [7], 1-NN with DTW (1-NN DTW) [7], 1-NN with DTW with the best warping window (1-NN DTW (Best)) [27], and 1-NN with Move-Split-Merge (1-NN MSM) [35]. In the figure, BLSTM and VGG refer to the previous models with only time series features. It can be seen that the proposed method performed on the upper end of the methods. Using the best window with VGG was ranked higher than most of the models. The methods with better overall scores were a large ensemble of many classifiers (COTE), a classical method, and two other neural networks.

5.5 Qualitative Analysis

Fig. 4.
figure 4

Examples of test patterns classified by Single VGG TS and the proposed Input Fusion VGG TS+MP (7%) from the RefrigerationDevices dataset. The upper row of each subfigure is the original Time Series feature and the lower row is the Matrix Profile feature. Two examples from each class are shown.

Figure 4 is a comparison of test samples classified by a standard VGG using the normal time series features and the proposed Input Fusion network with a VGG backbone using both the time series features and the matrix profile features. Noticeably, the proposed method performed better when the matrix profile features had more discords. In Figs. 4(b) and (c), the matrix profile features were often small with only sparse and narrow peaks. Conversely, when the proposed method excelled, the discord peaks were wider and more frequent. Thus, it can be inferred that the proposed method is better suited to signals with more frequent discords.

5.6 Ablation

Table 2. Average Test Accuracy (%) of the 24 Datasets Trained Five Times

An ablation study is performed to demonstrate the usefulness of adding matrix profile-based features. Table 2 compares using a network on a single feature (time series features or matrix profile features) as well as the difference in using an input fusion network versus a feature fusion network. Feature Fusion refers to a network that has branches for both features and concatenates them at the fully connected layer. Also, in order to demonstrate that the improved results of the proposed method are not strictly due to having more parameters, a fair comparison is made using only the time series features for modalities of the fusion networks (TS+TS).

Table 2 shows that using a fusion network with only time series does not have a significant difference with just having a single network. The results between Single TS and Input Fusion/Feature Fusion TS+TS are within a percent of each other. However, the TS+MP trials all have large increases over Single TS. This indicates that the matrix profile-based features provide supplemental information for the network to learn from.

Fig. 5.
figure 5

Effect of the Matrix Profile window size.

5.7 Effect of Window Size

Matrix Profile has one hyperparameter, the window size. The value of this hyperparameter has an effect on the robustness of the proposed method. Figure 5 shows the average accuracy at different window sizes. As can be seen in the figure, after about a window size 7% of the time series, the accuracy starts to decrease. Furthermore, when using the Matrix Profile features only, the accuracy of the VGG quickly decreases after a small window. Despite this, the Input Fusion and Feature Fusion still increases.

6 Conclusion

In this paper, we propose the use of motif-based features to supplement time series in classification. The motif-based features are features that represent motifs and discords in time series. The features are created by using Matrix Profile to generate a second modality of data to represent the time series. Due to the features being similar in length to the original time series, we can use them as a sequence in multi-modal neural networks.

Through the experiments, we demonstrate that using the motif-based features alongside the time series features can be used to increase the accuracy of BLSTMs and Temporal CNNs in fusion networks. We performed an extensive evaluation using all of the signal and device time series patterns from the UCR Time Series Archive, which includes 24 time series datasets.