Keywords

1 Introduction

Multivariate time series classification plays a critical role in various fields, such as gesture recognition [29], disease diagnosis [15], and brain-computer interfaces [6]. Recent years have witnessed Transformer-based methods making remarkable breakthroughs in numerous disciplines, such as natural language processing [19, 24], computer vision [1, 9], and visual-audio speech recognition [20, 22]. This success has inspired an increasing application of Transformer-based architectures [4, 16, 30] to multivariate time series classification. Besides, Transformer’s ability to perform parallel computation and leverage long-range dependencies in sequential data make it especially suitable for modeling time series data [26].

Since Transformer is position-insensitive, positional embedding was introduced to allow the model to learn the relative position of tokens. Positional embedding generally injects position information into sequence data [23]. It takes the form of sinusoidal functions of different frequencies, with each embedding dimension corresponding to a sinusoid whose wavelengths form a geometric progression. To date, positional embedding has been a routine for Transformer-based models that deal with multivariate time series classification problems.

Despite the widespread use, there have been debates [27, 31] around the necessity of positional embedding, and a comprehensive investigation of positional embedding’s effectiveness on various Transformer-based models is still to be developed. Firstly, Transformer-based models [14, 33] that contain position-sensitive modules (e.g., convolutional and recurrent layers) can automatically learn the position information, making positional embedding redundant to some extent. This point is supported by studies in other fields [18, 28] suggesting positional embedding may be unnecessary and replaced with position-sensitive layers. Secondly, positional embedding has inherent limitations that may potentially impair the classifier’s performance. Since positional embedding is hand-crafted, it may bring inductive bias that may adversely impact the model’s performance in some cases. While positional embedding injects the same position tokens into time series of different classes, it poses additional challenges to the classifier in figuring out the differences between sequences with different class labels.

In this paper, we explore the impact of positional embedding on various Transformer-based models to facilitate researchers and practitioners in making informed decisions on whether to incorporate positional embedding in their models for multivariate time series classification. In a nutshell, we make the following contributions in this paper:

  • We comprehensively review existing Transformer-based models that contain position-sensitive layers and summarize them into six types of Transformer-based variants.

  • We conduct extensive experiments on 30 multivariate time series classification datasets and evaluate the impact of positional embedding on the vanilla Transformer and Transformer-based variants.

  • Our results show that positional embedding positively impacts the performance of the vanilla Transformer while negatively influencing the performance of the Transformer-based variants in multivariate time series classification.

Fig. 1.
figure 1

A summary of Transformer-based variants for modeling sequential data.

2 Background

2.1 Positional Embedding

Positional embedding was first proposed for Transformer in [23], which uses fixed sine and cosine functions of different frequencies to represent the position information, as described below:

$$\begin{aligned} \begin{aligned} P E_{(p o s, 2 i)} &=\sin \left( p o s / 10000^{2 i / d_{\text{ model } }}\right) \\ P E_{(p o s, 2 i+1)} &=\cos \left( p o s / 10000^{2 i / d_{\text{ model } }}\right) \end{aligned} \end{aligned}$$
(1)

where pos and i are the position and the dimension indices, respectively. \(d_{model}\) is the dimensionality of the input time series, where each dimension of thepositional embedding corresponds to a sinusoid. For any fixed offset k, \(PE_{\text{ pos }+k}\) is represented as a linear function of \(PE_{\text {pos}}\); this enables the model to learn the relative positions easily. The positional embedding is then added to the time series as the input to Transformer.

Considering hand-crafted positional embedding is generally less expressive and adaptive [26], Time Series Transformer (TST) [32] enhances the vanilla Transformer [23] by implementing learnable positional embedding. Specifically, TST shares the same architecture as the vanilla Transformer, which stacks several basic blocks, each consisting of scaled dot-product multi-head attention and a feed-forward network (FFN) to leverage temporal data. But it differs in initializing the positional embedding using fixed values and then updating the embedding jointly with other model parameters through the training procedure.

2.2 Transformer-Based Variants

Current studies often incorporate convolutional or recurrent layers in the vanilla Transformer architecture in dealing with sequence-related tasks, including time series analysis. Figure 1 summarize these Transformer-based variants into six categories of representative methods, as detailed below.

  • Convolutional Embedding: Methods in this category, namely Informer [33], Tightly-Coupled Convolutional Transformer (TCCT) [21], and ETSformer [27], implement a convolutional layer to obtain convolutional embeddings, which map the raw input sequences to a latent space before feeding them to the transformer block.

  • Convolutional Attention: Instead of calculating the point-wise attention, LogTrans [13] and Long-short Transformer [34] use the convolutional layer to calculate the attention matrix (including queries, keys, and values) of segments to leverage the local temporal information.

  • Convolutional Feed-forward: Uni-TTS [17] and Conformer [8] implement a convolutional layer after the multi-head attention as the feed-forward layer (or part of the feed-forward layer) to capture local temporal correlations.

  • Recurrent Embedding: Temporal Fusion Transformer (TFT) [14] and the work in [3] use a recurrent layer to encode content-based order dependencies into the input sequence.

  • Recurrent Attention: Recurrent Memory Transformer [2], Block Recurrent Transformer [11], and R-Transformer [25] use a recurrent neural net to calculate the attention matrix, which harnesses the temporal information more effectively when compared with the point-wise attention.

  • Recurrent Feed-forward: Instead of point-wise feed-forward, TRANS-BLSTM [10] uses a recurrent layer after multi-head attention to harness non-linear temporal dependencies.

3 Problem Definition

A multivariate time series sequence can be described as: \(X=\left\{ x_1, x_2, \ldots x_T\right\} \), where \(x_i \in \mathbb {R}^N\), \(i \in \{1,2,\cdots ,T\}\), T is the maximum length of the sequence, and N is the number of variables. A dataset contains multiple (sequence, label) pairs and is denoted by \(D=\left\{ \left( X_1, y_1\right) ,\left( X_2, y_2\right) , \ldots ,\left( X_n, y_n\right) \}\right. \), where each \(y_k\) denotes a label, \(k\in \{1,2,\cdots ,n\}\). The objective of multivariate time series classification is to train a classifier to map the input sequences to probability distributions over the classes for the dataset D.

4 Methodology

We call TST [32] the basic model. To avoid having to compare all the related studies exhaustively, we design six Transformer-based variants based on the six types of techniques that are incorporated in the related studies (shown in Fig. 1), respectively. We further identify three configurable components of a Transformer architecture (shown in Fig. 2) as the input embedding layer (which projects the input time series into the latent space), the projection layer (which calculates the attention matrix), and the feed-forward layer (which leverages non-linear relationships). For each variant, we try different techniques (e.g., a Linear layer, a Convolutional layer, or a Gated Recurrent Unit) in each layer/component, as detailed in Table 1.

Fig. 2.
figure 2

A general architecture of transformer-based variants for modeling sequential data. The corresponding relations between configurable components and the respective candidate techniques are indicated by dashed lines.

Table 1. Configurations for the basic model and six variants. ConvEmbedding means Convolutional Embedding Variant; RecEmbedding means Recurrent Embedding Variant; the same naming rule applies to other models.

4.1 Basic Model

The basic model adopts linear layers in all three components. In this case, for each sample \(\textbf{x}_{\textbf{t}} \in \mathbb {R}^M: \textbf{X} \in \mathbb {R}^{M\times T}=\left[ \textbf{x}_1, \textbf{x}_2, \ldots , \textbf{x}_{\textbf{T}}\right] \), where T is the sequence length and M is the variable number. The input embedding can be described as:

$$\begin{aligned} U_t=W^x x_t+b^x \end{aligned}$$
(2)

where \(t=0, 1, ..., T\) is the time stamp index, \(W^x\in \mathbb {R}^{M \times d_k}\) and \(b^x\in \mathbb {R}^{d_k}\) are learnable parameters. The projection layer can be described as:

$$\begin{aligned} \begin{aligned} &Q=W^Q U_t+b^Q \\ &K=W^K U_t+b^K \\ &V=W^V U_t+b^V \end{aligned} \end{aligned}$$
(3)

where \(W^Q\in \mathbb {R}^{d_k \times d_k}\), \(W^K\in \mathbb {R}^{d_k \times d_k}\), \(W^V\in \mathbb {R}^{d_k \times d_k}\), \(b^Q\in \mathbb {R}^{d_k}\), \(b^K\in \mathbb {R}^{d_k}\), and \(b^V\in \mathbb {R}^{d_k}\) are are learnable parameters. We use standard scaled Dot-Product attention proposed in the vanilla Transformer [23] for self-attention calculation:

$$\begin{aligned} {\text {Attention}}(Q, K, V)={\text {softmax}}\left( \frac{Q K^T}{\sqrt{d_k}}\right) V. \end{aligned}$$
(4)

The feed-forward layer can be described as:

$$\begin{aligned} F F N(x)={\text {ReLU}}\left( W_1 x+b_1\right) W_2+b_2 \end{aligned}$$
(5)

where \(W_1\in \mathbb {R}^{d_k \times d_k}\), \(W_2\in \mathbb {R}^{d_k \times d_k}\), \(b_1\in \mathbb {R}^{d_k}\), and \(b_2\in \mathbb {R}^{d_k}\) are all leanable parameters.

4.2 Convolutional-Based Variants

We refer to the architectures that employ convolutional layers in any of the three components (input embedding layer, projection layer, or feed-forward layer) as convolutional-based variants. Here, we utilize a one-dimensional convolutional layer with a kernel size of 3. We also set the padding to 1 to preserve the lengths of representations. In the following, we illustrate our convolutional-based variants one by one.

Convolutional Embedding Variant replaces the linear layer with the convolution layer in the input embedding layer, which is formulated below:

$$\begin{aligned} U_t=W^x * x_t+b^x \end{aligned}$$
(6)

where \(*\) is the convolutional operation, \(W^x\in \mathbb {R}^{M \times d_k \times P}\) and \(b^x\in \mathbb {R}^{M}\) are learnable parameters, and P is the kernel size.

Convolutional Attention Variant replaces the linear layer with the convolution layer in the projection layer, which is formulated below:

$$\begin{aligned} \begin{aligned} &Q=W^Q * U_t+b^Q \\ &K=W^K * U_t+b^K \\ &V=W^V * U_t+b^V \end{aligned} \end{aligned}$$
(7)

where \(W^Q\in \mathbb {R}^{d_k \times d_k \times P}\), \(W^K\in \mathbb {R}^{d_k \times d_k \times P}\), \(W^V\in \mathbb {R}^{d_k \times d_k \times P}\), \(b^Q\in \mathbb {R}^{d_k}\), \(b^K\in \mathbb {R}^{d_k}\), and \(b^V\in \mathbb {R}^{d_k}\) are learnable parameters.

Convolutional Feed-forward Variant formulated the linear layer with the convolution layer in the feed-forward layer, which is described below:

$$\begin{aligned} F F N(x)={\text {ReLU}}\left( W_1 * x+b_1\right) * W_2+b_2 \end{aligned}$$
(8)

where \(W_1\in \mathbb {R}^{d_k \times d_k \times P}\), \(W_2\in \mathbb {R}^{d_k \times d_k \times P}\), \(b_1\in \mathbb {R}^{d_k}\), and \(b_2\in \mathbb {R}^{d_k}\) are the leanable parameters.

4.3 Recurrent-Based Variants

We name the architectures that use recurrent layers in any of the three components (input embedding layer, projection layer, or feed-forward layer) as recurrent-based variants. Here, we use Gate Recurrent Unit (GRU) [5] as the recurrent layer. In the following, we illustrate our recurrent-based Variants one by one.

Recurrent Embedding Variant replaces the linear layer with the GRU in the input embedding layer, which is formulated below:

$$\begin{aligned} \begin{aligned} r_t &=\sigma \left( W_{i r}^x x_t+b_{i r}^x+W_{h r}^x U_{(t-1)}+b_{h r}^x\right) \\ z_t &=\sigma \left( W_{i z}^x x_t+b_{i z}^x+W_{h z}^x h_{(t-1)}+b_{h z}^x\right) \\ n_t &=\tanh \left( W_{i n}^x x_t+b_{i n}^x+r_t \circ \left( W_{h n}^x h_{(t-1)}+b_{h n}^x\right) \right) \\ h_t &=\left( 1-z_t\right) \circ n_t+z_t \circ h_{(t-1)} \\ U_t &=Concat(h_1, h_2, ..., h_T) \end{aligned} \end{aligned}$$
(9)

where \(W_{i r}^x\in \mathbb {R}^{M \times d_k}\), \(W_{i z}^x\in \mathbb {R}^{M \times d_k }\), \(W_{i n}^x\in \mathbb {R}^{M \times d_k}\), \(W_{h r}^x\in \mathbb {R}^{d_k \times d_k}\), \(W_{h z}^x\in \mathbb {R}^{d_k \times d_k}\), \(W_{h n}^x\in \mathbb {R}^{d_k \times d_k}\), \(b_{i r}^x\in \mathbb {R}^{d_k}\), \(b_{h r}^x\in \mathbb {R}^{d_k}\), \(b_{i z}^x\in \mathbb {R}^{d_k}\), \(b_{h z}^x\in \mathbb {R}^{d_k}\), \(b_{i n}^x\in \mathbb {R}^{d_k}\), and \(b_{h n}^x\in \mathbb {R}^{d_k}\) are learnable parameters, \(\circ \) is the Hadamard product.

Recurrent Attention Variant replaces the linear layer with the GRU in the projection layer. Since the calculation processes of all the matrices are similar, for simplicity, we only present the calculation process of the query matrix Q in the projection layer below:

$$\begin{aligned} \begin{aligned} r_t &=\sigma \left( W_{i r}^Q U_t+b_{i r}^Q+W_{h r}^Q U_{(t-1)}+b_{h r}^Q\right) \\ z_t &=\sigma \left( W_{i z}^Q U_t+b_{i z}^Q+W_{h z}^Q h_{(t-1)}+b_{h z}^Q\right) \\ n_t &=\tanh \left( W_{i n}^Q U_t+b_{i n}^Q+r_t \circ \left( W_{h n}^Q h_{(t-1)}+b_{h n}^Q\right) \right) \\ h_t &=\left( 1-z_t\right) \circ n_t+z_t \circ h_{(t-1)} \\ Q &=Concat(h_1, h_2, ..., h_T) \end{aligned} \end{aligned}$$
(10)

where \(W_{i r}^Q\in \mathbb {R}^{d_k \times d_k}\), \(W_{i z}^Q\in \mathbb {R}^{d_k \times d_k }\), \(W_{i n}^Q\in \mathbb {R}^{d_k \times d_k}\), \(W_{h r}^Q\in \mathbb {R}^{d_k \times d_k}\), \(W_{h z}^Q\in \mathbb {R}^{d_k \times d_k}\), \(W_{h n}^Q\in \mathbb {R}^{d_k \times d_k}\), \(b_{i r}^Q\in \mathbb {R}^{d_k}\), \(b_{h r}^Q\in \mathbb {R}^{d_k}\), \(b_{i z}^Q\in \mathbb {R}^{d_k}\), \(b_{h z}^Q\in \mathbb {R}^{d_k}\), \(b_{i n}^Q\in \mathbb {R}^{d_k}\), and \(b_{h n}^Q\in \mathbb {R}^{d_k}\) are learnable parameters.

Recurrent Feed-forward Variant replaces the linear layer with the GRU in the feed-forward layer, which is formulated below:

$$\begin{aligned} \begin{aligned} r_t &=\sigma \left( W_{i r} U_t+b_{i r}+W_{h r} U_{(t-1)}+b_{h r}\right) \\ z_t &=\sigma \left( W_{i z} U_t+b_{i z}+W_{h z} h_{(t-1)}+b_{h z}\right) \\ n_t &=\tanh \left( W_{i n} U_t+b_{i n}+r_t \circ \left( W_{h n} h_{(t-1)}+b_{h n}\right) \right) \\ h_t &=\left( 1-z_t\right) \circ n_t+z_t \circ h_{(t-1)} \\ O &=Concat(h_1, h_2, ..., h_T) \end{aligned} \end{aligned}$$
(11)

where \(W_{i r}\in \mathbb {R}^{d_k \times d_k}\), \(W_{i z}\in \mathbb {R}^{d_k \times d_k }\), \(W_{i n}\in \mathbb {R}^{d_k \times d_k}\), \(W_{h r}\in \mathbb {R}^{d_k \times d_k}\), \(W_{h z}\in \mathbb {R}^{d_k \times d_k}\), \(W_{h n}\in \mathbb {R}^{d_k \times d_k}\), \(b_{i r}\in \mathbb {R}^{d_k}\), \(b_{h r}\in \mathbb {R}^{d_k}\), \(b_{i z}\in \mathbb {R}^{d_k}\), \(b_{h z}\in \mathbb {R}^{d_k}\), \(b_{i n}\in \mathbb {R}^{d_k}\), and \(b_{h n}\in \mathbb {R}^{d_k}\) are learnable parameters, and O is the final output of the feed-forward layer.

5 Experiments

We empirically evaluate the impact of positional embedding on the performance of the basic model and transformer-based variants (illustrated in Sect. 4) for multivariate time series classification. We report our experimental configurations and discuss the results in the following subsections.

5.1 Datasets

We selected 30 public multivariate time series datasets from the UEA Time Series Classification Repository [7]. All datasets were pre-split into training and test setsFootnote 1. We normalized all datasets to zero mean and unit standard deviation and applied zero padding to ensure all the sequences in each dataset bear the same length.

5.2 Model Configuration and Evaluation Metrics

We trained the basic model and six variants for 500 epochs using the Adam optimizer [12] on all the datasets with and without the learnable positional embedding. Besides, we applied an adaptive learning rate, which was reduced by a factor of 10 after every 100 epochs, and employed dropout regularization to prevent overfitting. Table 2 summarizes our model configurations for each dataset.

We evaluate the models using two metrics: accuracy and macro F1-Score. To mitigate the effect of randomized parameter initialization, we repeated the training and test procedures five times and took the average as the final results.

Table 2. Model configurations.
Table 3. Accuracy of different models on 30 benchmark datasets.
Table 4. Macro F1-Score of different models on 30 benchmark datasets.

5.3 Results and Analysis

Table 3 and Table 4 show the methods’ performance on the 30 datasets. The results show positional embedding positively impacts the basic model—with positional embedding, the basic model’s performance improves by 17.5% and 14.3% in accuracy and macro F1-Score, respectively. This reveals the significance of enabling the basic model to leverage the position information (e.g., via positional embedding) in solving the multivariate time series classification problem.

In contrast, positional embedding negatively impacts the performance of the Transformer-based variants. Without positional embedding, convolutional embedding (i.e., ConvEmbedding in Table 1) and recurrent embedding (i.e., RecEmbedding in Table 1) models outperformed all other variants, achieving the best accuracy of 56.21% and 56.17%, respectively, and the best macro F1-Scores of 0.528 and 0.5375, respectively. These two models differ from all other models in that their input embedding layers encode the position information when projecting the raw data to a latent space, making the position information accessible by subsequent layers for feature extraction and resulting in superior performance. Incorporating positional embedding decreased the average accuracy of the variants by 12.7% (convolutional embedding), 9.1% (convolutional attention), 18.6% (convolutional feed-forward), 22.1% (recurrent embedding), 21.5% (recurrent attention), and 15.7% (recurrent feed-forward), respectively. Results for the macro F1-Score show similar trends.

Since the convolutional and recurrent layers can inherently capture the position information from sequential data, it is natural to consider positional embedding redundant for Transformer-based variants. Besides, positional embedding risks introducing inductive bias and contaminating the original data. Specifically, positional embedding injects the same information into sequences of different classes, bringing new challenges to the classifiers; this may also contribute to performance degradation.

Further reflecting on the results, we suggest that positional embedding may not be necessary for Transformer-based variants that already contain position-sensitive modules. In particular, for time series classification tasks, while the classifier focuses on the differences between time series sequences across different classes, positional embedding is content-irrelevant, adding the same position information to all sequences regardless of their class labels. As position-sensitive modules generally consider content information when encoding the position information, redundant content-irrelevant positional embedding may lead the model towards capturing spurious correlations that potentially hinder the classifier’s performance.

6 Conclusion

Existing Transformer-based architectures generally contain position-sensitive layers while routinely incorporating positional embedding without comprehensively evaluating its effectiveness on multivariate time series classification. In this paper, we investigate the impact of positional embedding on the vanilla Transformer architecture and six types of Transformer-based variants in multivariate time series classification. Our experimental results on 30 public time series datasets show that positional embedding lifts the performance of the vanilla Transformer while adversely impacting the performance of Transformer-based variants on classification tasks. Our findings refute the necessity of incorporating positional embedding in Transformer-based architectures that already contain position-sensitive layers, such as convolutional or recurrent layers. We also advocate applying position-sensitive layers directly on the input for any Transformer-based architecture that considers using position-sensitive layers to gain better results in multivariate time series classification.