Permutation Dependent Feature Mixing for Multivariate Time Series Forecasting

Yamazono, Rikuto; Hachiya, Hirotake

doi:10.1007/978-3-031-70352-2_18

Rikuto Yamazono¹³ &
Hirotake Hachiya^13,14

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14943))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

148 Accesses

Abstract

Multivariate time series forecasting is critical in finance and meteorology, influencing decision-making. Though effective in capturing long-range dependencies in natural language processing, traditional Transformer models face challenges when applied to time series data, including computational inefficiency and the loss of positional encoding effects. Time-Series Mixer (TSMixer) addresses these issues by efficiently blending the temporal and feature dimensions in multivariate time series data, thereby facilitating sequential dependent feature extraction. However, the current feature mixing in TSMixer applies a common multi-layer perception across all time steps, leading to time-invariant, non-adaptive feature exchange that does not allow for accurate extraction of historical information. Therefore, we propose incorporating adaptive frequency components and event proximity as additional information vectors into the Feature Mixing component of TSMixer to improve its capacity to interpret complex feature interrelations. Our research validates the effectiveness of these enhancements through experiments with various real-world multivariate time series datasets, including weather and traffic data, emphasizing its potential across different scenarios. Codes are available at https://github.com/rikuter67/FAM-EPAM.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Keywords

1 Introduction

Time series forecasting is indispensable across various sectors, shaping crucial decision-making processes. Traditional methods, such as ARIMA [4, 5], rely on predefined models to capture trends and cycles of historical series. While they are effective for stationary series, their fixed structure and inability to capture the dynamic dependencies among multiple features would result in poor performance with real-world data.

Deep neural networks with recurrent architectures, i.e., RNNs [12, 15], and memory cells, i.e., LSTMs [7, 10], enable to capture dynamic and temporal dependencies by encoding significant information into latent vectors extracted from historical series. However, RNNs with a one-step recurrent connection face limitations on long-term dependencies due to gradient vanishing and exploding issues. While LSTMs with gates and cells can capture longer-term dependencies, their effectiveness is still constrained due to the finite capacity of the memory cells and the complexity of their computational processes.

Transformers [11, 14, 16, 18] introduce an attention mechanism, e.g., self-attention, which allows the model to directly encode important information into the sequence vectors themselves based on the relationship, e.g., co-occurrence within the sequence. This mechanism enables models to process sequences in parallel and provides an enhanced capacity to capture complex and long-term dependencies. However, the inherent permutation invariance of the attention mechanisms poses significant challenges for processing time series data despite the success in natural language processing, which employs a heuristic extension, called positional encoding, to add information about the order of time steps.

In light of this, recent studies have suggested the potential of linear models, which are inherently sensitive to the order of the inputs, for sequential data [17]. Among linear models, the TSMixer (Time-Series Mixer) [6] model employs repeated MLPs (Multi-layer perceptions) to mix time and feature information alternately, encoding useful information into sequence vectors from complex multivariate time series data. However, TSMixer’s feature mixing approach uses a common MLP across all time steps, leading to time-invariant, non-adaptive feature mixing, hindering the accurate extraction of historical information.

To address this issue, we propose enhancements to the Feature Mixing component of TSMixer. Firstly, we propose a Frequency-Aware Mixer (FAM), which adds an adaptive frequency component of each feature to the feature-mixing, enabling the adjustment of the strength of feature mixing based on the adaptive time cycles. Secondly, we propose an Event Proximity-Aware Mixer (EPAM), which adds the proximity to the principal observation (event) as an additional component of the feature mixing, enabling the strength of the feature mixing to be adjusted based on the relation to the representative events. These enhancements will enable the model to more accurately grasp the complex interrelations among features.

The main contributions of this paper are summarized as follows:

1.
We propose to enhance the Feature Mixing component of TSMixer, the state-of-the-art multivariate time-series forecasting method, to allow for time-dependent and adaptive mixing by introducing principal frequency components, called Frequency-Aware Mixer (FAM) and the distance to principal time-step, called Event Proximity-Aware Mixer (EPAM) as additional information vectors.
2.
We demonstrate the effectiveness of the proposed method over existing state-of-the-art multivariate time-series forecasting methods through extensive comparative experiments on various real-world datasets.

After this introductory section, the rest of this paper is organized as follows. Section 2 describes the formulation and reviews related works. Section 3 details the proposed method. Section 4 describes the experimental evaluation and discussion; Sect. 6 presents the conclusion.

2 Formulation and Related Works

This section formulates the problem of multivariate time series forecasting and reviews its related works.

2.1 Formulation

Let $X_{tc}$ denote the $c$-th observation at time $t$ and : denote all elements at the corresponding axis. Let $X^{-t}\in \mathbb {R}^{L\times C}$ be $L$-step history of observations where $X^{-t}_{t:} \in \mathbb {R}^{1 \times C}$ be a vector of $C$ observations at time step $t$ and $X^{-t}_{:c} \in \mathbb {R}^{L\times 1}$ be a vector of the $c$-th observation over $L$-step. Meanwhile, let $X^{t+}\in \mathbb {R}^{T\times C}$ be $T$-step future observations starting from the next step of $X^{-t}$ as follows:

$$\begin{aligned} X^{-t}&= \big [X_{t-L+1:},\ldots ,X_{t-1:},X_{t:}\big ],\nonumber \\ X^{t+}&= \big [X_{t+1:},X_{t+2:}\ldots ,X_{t+T:}\big ]. \end{aligned}$$

(1)

The task of multivariate time-series forecasting is to obtain a model $ f(\cdot ) $ to predict future observations $ X^{t+}$ given its history $ X^{-t}$ as follows:

$$\begin{aligned} \widehat{X^{t+}} = f_{\boldsymbol{\theta }}(X^{-t}), \end{aligned}$$

(2)

Parameter $\boldsymbol{\theta }$ of the model is tuned to minimize the loss function $\mathcal {L}(\cdot )$ averaged over training data $\mathcal {D}_{\textrm{tr}}$ as follows:

$$\begin{aligned} \min _{\boldsymbol{\theta }} \frac{1}{|\mathcal {D}_{\textrm{tr}}|} \sum _{t=L}^{|\mathcal {D}_{\textrm{tr}}|}\mathcal {L}\big (X^{t+}, f_{\boldsymbol{\theta }}(X^{-t}) \big ), \end{aligned}$$

(3)

where $\mathcal {D}_{\textrm{tr}}$ is defined as follows:

$$\begin{aligned} \mathcal {D}_{\textrm{tr}}\equiv \Big \{ \big (X^{-t}, X^{t+}\big )\Big \}_{t=L}^{N_{\textrm{tr}}-T}, \end{aligned}$$

(4)

where $N_{\textrm{tr}}$ is the number of steps in the training sequence. Similarly, the validation and test data are defined as follows:

$$\begin{aligned} \mathcal {D}_{\textrm{val}}&\equiv \Big \{ \big (X^{-t}, X^{t+}\big )\Big \}_{t=N_{\textrm{tr}}}^{N_{\textrm{tr}}+N_{\textrm{val}}-T},\nonumber \\ \mathcal {D}_{\textrm{te}}&\equiv \Big \{ \big (X^{-t}, X^{t+}\big )\Big \}_{t=N_{\textrm{tr}}+N_{\textrm{val}}}^{N_{\textrm{tr}}+N_{\textrm{val}}+N_{\textrm{te}}-T}, \end{aligned}$$

(5)

where $N_{\textrm{val}}$ and $N_{\textrm{te}}$ are the numbers of steps in the validation and test sequence, respectively—there is no overlap between training, validation, and test sequences. As Fig. 1 illustrates the formulation overview.

2.2 Attention Mechanisms in Time Series Forecasting

Attention mechanisms have been applied to time series forecasting [11, 14, 16, 18], enabling models to dynamically encode important information into the sequence vectors $X^{-t}$ based on the relationships between observation vectors at different time steps. More specifically, in the attention mechanism, affinity weight $W^\textrm{att}\in \mathbb {R}^{L\times L}$ is computed based on the similarity between query vectors $Q\in \mathbb {R}^{L\times C}$ and key vectors $K\in \mathbb {R}^{L\times C}$. Next, query $Q$ is transformed through an interpolation of value vectors $V\in \mathbb {R}^{L\times C}$, as follows:

$$\begin{aligned} W^\textrm{att}= \textrm{softmax}\Big ( \frac{(QW^Q)\, (KW^K)^\top }{\sqrt{C}} \Big ),\nonumber \\ Q' = \text {Attention}(Q,K,V)= W^\textrm{att}(V\, W^V), \end{aligned}$$

(6)

where $ W^Q, W^K$, and $ W^V\in \mathbb {R}^{C\times C} $ are trainable linear projection matrices. This weight $W^\textrm{att}$ captures the dependencies across the time series, making attention mechanisms particularly useful for identifying intricate temporal relationships.

There are several extensions to overcome the limitation of Transformer for multivariate time-series forecasting, such as Autoformer [16], Informer [18], and PatchPST [11]. Autoformer incorporates an autocorrelation mechanism to capture long-term dependency and a deep decomposition architecture that sequentially decomposes the time series data into trend, seasonal, and random components during forecasting. Informer introduces ProbSparse self-attention, which probabilistically selects important queries with higher attention scores to reduce computational complexity enabling the efficient capture of long-term dependency. PatchTST divides the time series into patch chunks treated as time steps in the attention mechanism and applies single embedding and attention mechanism individually to each multivariate series, enabling efficient multivariate time-series forecasting.

However, due to the attention mechanisms’ inherent permutation invariance property, Transformer models face challenges when directly applied to time series data, where the order of time steps critically impacts forecasting accuracy. Currently, Transformer models attempt to address this issue through positional encoding, which aims to inject sequence information into the model. Yet, there is a concern that the effect of positional encoding may diminish as the attention mechanism is applied repeatedly.

2.3 TSMixer: An All-MLP Architecture for Time Series Forecasting

Recent research [17] has highlighted that simple linear models can be highly effective for time series forecasting, surpassing Transformer-based models, i.e., Autoformer [16], Informer [18] and FEDformer [19]. As illustrated in Fig. 2, TSMixer applies MLPs alternatively in time and feature domains. TSMixer consists of three main components: Time-mixer, Feature-mixer, and Temporal Projection as follows:

Time-Mixer encodes temporal information, e.g., long-term dependencies, into the history $X^{-t}$ by blending across the time direction $X^{-t}_{:c}$ as follows:

$$\begin{aligned} \text {TM}(X^{-t}_{:c}) &= \text {Drop}\Big (\sigma \big (({X^{-t}_{:c}})^\top W_\text {TM}+ \boldsymbol{b}_\text {TM}\big )\Big ),\nonumber \\ X^{-t}_{:c} &\leftarrow \text {Norm}\Big (X^{-t}_{:c} + \text {TM}(X^{-t}_{:c})^\top \Big ), \end{aligned}$$

(7)

where $W_{\text {TM}} \in \mathbb {R}^{L\times L}$ and $\boldsymbol{b}_{\text {TM}} \in \mathbb {R}^{1\times L}$ are trainable weight and bias, respectively. $ \sigma (\cdot ) $, $ \text {Drop}(\cdot ) $, and $ \text {Norm}(\cdot ) $ represent an activation function, i.e., ReLU, a dropout operation, and a normalization operation, i.e., 2D batch normalization applied over the $L\times C$ plane along the batch dimension, respectively. Note that the same time-mixing MLPs are shared across all types of features.

Feature-Mixer encodes information regarding the relationship among different observations, e.g., co-occurrence of observations, into the history $X^{-t}$ by blending across the feature direction $X^{-t}_{t:}$ as follows:

$$\begin{aligned} U_{t:} = \text {Drop}\Big (\sigma \big (X^{-t}_{t:}W_{\text {FM}_1} &+ \boldsymbol{b}_{\text {FM}_1}\big )\Big ),\ \ \text {FM}({X^{-t}_{t:}}) = \text {Drop}\Big (U_{t:}W_{\text {FM}_2} + \boldsymbol{b}_{\text {FM}_2}\Big ),\nonumber \\ {X^{-t}_{t:}} &\leftarrow \text {Norm}\Big (X^{-t}_{t:} + \text {FM}({X^{-t}_{t:}})\Big ), \end{aligned}$$

(8)

where $W_{\text {FM}_1} \in \mathbb {R}^{C\times H}$ and $W_{\text {FM}_2} \in \mathbb {R}^{H\times C}$ are trainable weights, and $\boldsymbol{b}_{\text {FM}_1} \in \mathbb {R}^{1 \times H}$ and $\boldsymbol{b}_{\text {FM}_2} \in \mathbb {R}^{1 \times C}$ are trainable biases. $U\in \mathbb {R}^{L\times H}$ represents the hidden variables with the the number $H$ of nodes. Note that the same feature-mixing MLPs are shared across all time steps.

Temporal Projection compresses the $L$-step history $X^{-t}_{:c}$ to the length of future prediction period, i.e., $T$-step, using a fully-connected layer as follows:

$$\begin{aligned} \widehat{X^{t+}_{:c}} = \left( ({X^{-t}_{:c}})^\top W_{\text {TP}} + \boldsymbol{b}_{\text {TP}}\right) ^\top , \end{aligned}$$

(9)

where $X^{-t}$ represents the output of the Mixer Layer, $W_{\text {TP}} \in \mathbb {R}^{L\times T}$ and $\boldsymbol{b}_{\text {TP}} \in \mathbb {R}^{1 \times T}$ are trainable weight and bias.

The permutation-sensitive properties of the time-mixing MLPs in TSMixer empower the model to effectively capture the dynamic relationships among observations along the time direction, enhancing the prediction performance for multivariate time series data. On the other hand, a limitation exists in the feature-mixing MLPs where the same MLPs are used across time direction, and thus, the identical transformation is applied to vector $X^{-t}_{t:}$ regardless of position in the sequence. This permutation-invariant feature mixing avoids capturing significant relationships between observations, e.g., co-occurrence of observations related to trend and seasonal cycles, and potentially degrades the prediction performance.

3 Proposed Method

To allow for time-dependent and adaptive feature mixing, we propose to enhance the feature mixing by introducing principal frequency components, called Frequency-Aware Mixer (FAM), and the distance to the principal time step, called Event Proximity-Aware Mixer (EPAM), as additional information vectors.

3.1 Frequency-Aware Mixer (FAM)

We propose Frequency-Aware Mixer (FAM) which incorporates an adaptive frequency component into the first layer of feature-mixing MLPs (in Eq. 8 and Fig. 2) as depicted in Fig. 4 and as follows:

$$\begin{aligned} U_{t:} &= \text {Drop}\Big (\sigma \big (X^{-t}_{t:}W_{\text {FM}_1} + \underline{S^\textrm{fre}_{t:} W_{\textrm{fre}}} + \boldsymbol{b}_{\text {FM}_1}\big )\Big ),\nonumber \\ S^\textrm{fre}_{tk} &= a_k\cos \left( \frac{2\pi p_k}{N_{\textrm{tr}}}t\right) + b_k\sin \left( \frac{2\pi p_k}{N_{\textrm{tr}}}t\right) ,\ \ \ k=1, 2, \ldots , K, \end{aligned}$$

(10)

where the frequency component $S^\textrm{fre}_{t:} W_{\textrm{fre}}$ is a linear integration of $K$ different waveforms along time-axis, $S^\textrm{fre}\in \mathbb {R}^{L\times K}$ using weight $W_{\textrm{fre}} \in \mathbb {R}^{K\times H}$. $ a_k$, $ b_k$, and $ p_k$ are trainable parameters tuning the amplitude of cos and sin waves and the frequency for the $k$-th waveform, respectively.

To prepare initial waveforms $S^\textrm{fre}_{t:}$ representing training time-series data $\mathcal {D}_{\textrm{tr}}$, we apply a Fourier transform to the sequence $ [X_{0c}, X_{1c}, \ldots , X_{(N_{\textrm{tr}}-1)c} ] $ of each observation $c$ and extract $N_{\textrm{tr}}/2$ waves. Among the waves whose periods do not exceed the history length $L$, we select $m$ waves with the largest power spectra and set their amplitudes and frequencies as the initial values of $a$, $b$, and $p$ for each observation $c$—there are total $K=mC$ waves.

The feature-mixing in FAM is sensitive to permutations as the frequency components may vary with the time-step $t$. This allows for flexible feature mixing based on the inherent periodic characteristics of time series, e.g., seasonality, trend cycles, and cyclical cycles. This could potentially enhance the capacity of the model to capture the temporal dynamics of the data and improve the prediction performance.

3.2 Event Proximity-Aware Mixer (EPAM)

We propose Event Proximity-Aware Mixer (EPAM) which incorporates an adaptive proximity component into the first layer of feature-mixing MLPs (in Eq. 8 and Fig. 2) as depicted in Fig. 4 and as follows:

$$\begin{aligned} U_{t:} &= \text {Drop}\Big (\sigma \big (X^{-t}_{t:}W_{\text {FM}_1} + \underline{S^\textrm{pro}_{t:} W_{\textrm{pro}}} + \boldsymbol{b}_{\text {FM}_1}\big )\Big ),\nonumber \\ S^\textrm{pro}_{t:} &= X^{-t}_{t:}X^\textrm{rep}, \end{aligned}$$

(11)

where the proximity component $S^\textrm{pro}_{t:} W_{\textrm{pro}}$ is a linear integration of proximities to $R$ different representative observations (events), $S^\textrm{pro}\in \mathbb {R}^{L\times R}$, using weight $W_{\textrm{pro}} \in \mathbb {R}^{R\times H}$. $X^\textrm{rep}\in \mathbb {R}^{C\times R}$ is a set of $R$ representative observation vectors and $S^\textrm{pro}_{t:}$ is the inner product (similarity) between an observation vector $X^{-t}_{t:}$ at time-step $t$ and representative vectors $X^\textrm{rep}$.

To prepare representative vectors $X^\textrm{rep}$, we apply a clustering method, e.g., k-means, into $C$-dimensional vectors across all training time steps, $ \{ X_{t:} \}_{t=0}^{N_{\textrm{tr}}-1} $ and set $R$ cluster centroids as $X^\textrm{rep}$.

The feature mixing in EPAM is also sensitive to permutations as the proximity components may vary with the time-step $t$. This allows for flexible feature mixing based on the natural variability of time series due to the occurrence of various types of events, e.g., holiday and weather events, etc., potentially enhancing the capacity of the model to capture the fluctuation pattern of the data and improve the prediction performance.

3.3 Entire Architecture and Training

The architecture of the proposed method, i.e., $f_{\boldsymbol{\theta }}(X^{-t})$, is a variant of TSMixer depicted in Fig. 2 where its feature mixer component is replaced with our proposed permutation-sensitive feature mixier: FAM (in Fig. 3 or TPAM (in Fig. 4. We refer to the combination of TSMixer with our proposed feature mixers as TSMixer + FAM and TSMixer + TPAM, respectively.

For training the entire architecture, we used mean squared error (MSE) as the loss function $\mathcal {L}(\cdot )$ in Eq. 3 as follows:

$$\begin{aligned} \mathcal {L}(X^{t+},f_{\boldsymbol{\theta }}(X^{-t})) = \frac{1}{TC} \big \Vert X^{t+}- f_{\boldsymbol{\theta }}(X^{-t}) \big \Vert ^2_\textrm{F}, \end{aligned}$$

(12)

where $\Vert \cdot \Vert _\textrm{F}$ is Frobenius norm.

We use early stopping with 5-epoch patience based on the validation loss computed using the validation data described in Table 1 and select the best model with the minimum validation loss.

4 Experimental Evaluation

In this section, we show the effectiveness of the proposed method through experiments on seven popular multivariate long-term forecasting benchmarks such as weather, electricity, and traffic.

4.1 Setting and Comparative Methods

We set the length of history observations as $L= 512$ following the work [11], and the length of future prediction observations as $T\in \{96, 192, 336, 720\}$.

We compared the performance of prediction with the state-of-the-art multi-variate time series forecasting methods: Transformer-based and MLP-mixer-based models as follows:

Transformer-based models: we used codes with default settings provided in following githubs:
- Autoformer [16]: https://github.com/thuml/Autoformer
- Informer [18]: https://github.com/zhouhaoyi/Informer2020
- PatchTST [11]: https://github.com/yuqinie98/PatchTST
MLP-mixer-based models:
- TSMixer [6]: we used the basic version of TSMixer provided in the github https://github.com/google-research/google-research/tree/master/tsmixer and settings described in the work [6].
- TMix-Only: we eliminated the feature mixer component (in Fig. 2) from the above TSMixer following the work [6].
- TSMixer + FAM (proposed method, Sect. 3.1): we set the number of waves for each observation type as $m=3$ for datasets with fewer observation types, i.e., ETT and Weather, and $m=1$ for datasets with more types, i.e., Electricity and Traffic. We utilized numpy.fft.rfft function for the implementation of Fourier transform. Other settings are same as TSMixer. In addition, we applied reversible instance normalization (RevIN) into the each input $X^{-t}$ and output $\widehat{X^{t+}}$ of the model [9].
- TSMixer + EPAM (proposed method in Sect. 3.2): we set the number of representative observations as $R= 5$. We utilized sklearn.cluster module for k-means clustering. Other settings are same as TSMixer + FAM.

Table 1. Details of the datasets used in the experiments

Full size table

4.2 Datasets

We used seven real-world multi-variate time series datasets: ETT [13], Weather [3], Electricity [2], and Traffic [1], provided by the work of Autoformer [16] in https://github.com/thuml/Autoformer.

More specifically, Electricity Transformer Temperature (ETT) datasets contain two-year sequences of loads and oil temperature collected from electricity transformers every 1 h and 15 min. Weather dataset contains one-year sequences of 21 meteorological indicators, such as air temperature and humidity, recorded every 10 min. Electricity dataset contains three-year sequences of electricity consumption of 321 customers, collected every hour. Finally, Traffic dataset contains two-year sequences of road occupancy rates at 862 different places, recorded every hour. Table 1 summarizes the details of the datasets.

As a preprocessing step for the data, we divided the sequences from each dataset into training, validation, and test subsequences, as described in Table 1. We then calculated the mean and standard deviation for each subsequence $ [X_{0c}, X_{1c}, \ldots ] $ for each observation $c$ and sed these values to normalize the corresponding subsequences. Then, we used these normalized subsequences as training $\mathcal {D}_{\textrm{tr}}$ (in Eq. 4), validation $\mathcal {D}_{\textrm{val}}$, test $\mathcal {D}_{\textrm{te}}$ (in Eq. 5) data.

4.3 Result

The experimental results are shown in Table 2. The performance is measured using MSE of test data $\mathcal {D}_{\textrm{te}}$ as follows:

$$\begin{aligned} \textrm{MSE}(\mathcal {D}_{\textrm{te}}) = \frac{1}{TC|\mathcal {D}_{\textrm{te}}|} \sum _{(X^{-t}, X^{t+}) \in \mathcal {D}_{\textrm{te}}}\big \Vert X^{t+}- f_{\boldsymbol{\theta }}(X^{-t}) \big \Vert ^2_\textrm{F}. \end{aligned}$$

(13)

In principle, multivariate models that simultaneously consider the relationships between time and features are expected to offer higher flexibility and performance in time series forecasting compared to univariate models, which only independently consider the time series of individual features. However, Table 2 demonstrates that the performance of Autoformer and Informer in multivariate models is inferior to that of the univariate model. Furthermore, the performance of TSMixer is equivalent to that of TMix-Only, which lacks a feature mixer, indicating that the co-occurrences in feature direction provided by a feature mixer, are not necessarily important for prediction, as reported in the work [6].

Table 2. Performance comparison in multivariate time series forecasting. The performance is measured using the MSE computed from each test data $\mathcal {D}_{\textrm{te}}$ (in Eq. 13). Those performance surpassing TSMixer is indicated in red among multivariate models. In addition, the best performance among all models is indicated in bold.

Full size table

On the other hand, Table 2 shows that the proposed methods, TSMixer+FAM and TSMixer+EPAM, which enhance the feature-mixer with permutation sensitivity, outperform TSMixer in various datasets and future prediction steps, i.e., $T$. This indicates the potential of the proposed methods for permutation-dependent feature mixing in adaptively modeling relationships between features and reveals the potential importance of feature mixing in multivariate time series forecasting.

Table 3. Comparison of parameter counts and averaged inference time per instance, measured using the test data $\mathcal {D}_{\textrm{te}}$ in Traffic dataset.

Full size table

Table 3 presents a comparison of parameter counts and the average time per inference, measured using the test data $\mathcal {D}_{\textrm{te}}$ in Traffic dataset with the most types of observations as shown in Table 1. As the table shows, multivariate models tend to have larger parameters as the number of observations $C$ increases compared to univariate models. Furthermore, in the case of the TSMixer+FAM, model, datasets with a large number of observations result in a significantly higher number of extracted frequencies $K$ (in Eq. 10), leading to a much longer inference time compared to other models. In the future, it will be necessary to adopt strategies such as feature grouping to reduce the number of additional vectors while maintaining predictive accuracy.

4.4 Analysis

In addition, Fig. 5 depicts examples of forecasts by TSMixer, FAM, and EPAM, for T (Temperature), WV (Wind Velocity), rain(precipitation), SWDR (Short Wave Downward Radiation), and CO2 (CO2 concentration) of Weather dataset with forecasting length $T=720$—the history $X^{-t}_{:c}$ and ground truth $X^{t+}_{:c}$ (Eq. 1) in blue and the prediction $\widehat{X^{t+}}_{:c}$ (in Eq. 2) in orange.

Figure 5 shows that compared to TSMixer, TSMixer+FAM is able to more closely follow the true values with its periodic predictions such as in WV and CO2. This outcome underscores the impact of FAM’s dynamic frequency-dependent feature mixing, which is further enhanced by the incorporation of additional frequency information into the feature mixing process. Therefore, in fields such as multivariate time series forecasting involving frequency components, like temporal traffic patterns, FAM’s ability to accurately track waveforms is considered effective.

Similarly, EPAM is able to more closely follow the true values with its finer-grained predictions. This fine-grained prediction reflects the impact of EPAM’s approach of integrating distance information between major events and each historical point into the feature mixing. Therefore, in multivariate time series forecasting involving rapid changes within short periods, such as Electricity data, EPAM’s ability to make detailed adjustments is considered effective.

From these observations, it is clear that in the domain of multivariate time series data with complex inter-feature relationships, the proposed method can enhance the ability to utilize the relationships between features for more accurate forecasting.

To further analyze the effectiveness of the proposed method, we visualized the experimental results for different forecasting lengths across all comparison methods. Fig. 6 depicts examples of forecasts for Oil Temperature (OT) variable of ETTh1 dataset with forecasting length $T\in \{96,192,336,720\}$—the history $X^{-t}_{:c}$ and ground truth $X^{t+}_{:c}$ (Eq. 1) in blue and the prediction $\widehat{X^{t+}}_{:c}$ (in Eq. 2) in orange, where $c$ corresponds to OT.

In contrast to nonlinear multivariate models, i.e., Autoformer and Informer, which increasingly diverge from the ground truth as the forecast horizon extends, TSMixer-based models are capable of delivering forecasts that are on par with univariate models across various forecast lengths, $T$. Furthermore, compared to TSMixer, FAM tends to predict with more periodic trends, while EPAM tends to predict with finer amplitudes. This likely occurs because the models have effectively adjusted the mixing features based on the overall frequency of the training data and the proximity to representative events. Consequently, these models not only capture the temporal mixing of information but also extract useful information through feature mixing.

5 Conclusion

In this study, we enhance feature mixing in the TSMixer model for multivariate time series forecasting by introducing principal frequency components through the Frequency-Aware Mixer (FAM) and incorporating distances to principal time steps with the Event Proximity-Aware Mixer (EPAM). These proposed methods enable time-dependent and adaptive feature mixing, demonstrating the utility of permutation-dependent feature mixing for dynamically capturing the relationships between features. Experimental results across various real-world datasets indicate the potential of the proposed methods for permutation-dependent feature mixing in adaptively modeling relationships between features and reveal the importance of feature mixing in multivariate time series forecasting.

References

California Department of Transportation. https://pems.dot.ca.gov/. Accessed 23 Mar 2023
ElectricityLoadDiagrams. https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014. Accessed 23 Mar 2023
Max-Planck-Institut für Biogeochemie - Wetterdaten. https://www.bgc-jena.mpg.de/wetter/. Accessed 23 Mar 2023
Box, G.E.P., Jenkins, G.M.: Time Series Analysis, Forecasting and Control (1970)
Google Scholar
Box, G.E., Jenkins, G.M.: Some recent advances in forecasting and control (1968)
Google Scholar
Chen, S.A., Li, C.L., Yoder, N.C., Arık, S.O., Pfister, T.: Tsmixer: an all-mlp architecture for time series forecasting (2023). https://arxiv.org/abs/2303.06053
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. (1997)
Google Scholar
Khabbazan, S., et al.: Crop monitoring using sentinel-1 data: a case study from The Netherlands (2019)
Google Scholar
Kim, T., Kim, J., Tae, Y., Park, C., Choi, J.H., Choo, J.: Reversible instance normalization for accurate time-series forecasting against distribution shift (2022). https://openreview.net/forum?id=cGDAkQo1C0p
Lai, G., Chang, W.C., Yang, Y., Liu, H.: Modeling long- and short-term temporal patterns with deep neural networks. In: SIGIR (2018)
Google Scholar
Nie, Y., Nguyen, N.H., Sinthong, P., Kalagnanam, J.: A time series is worth 64 words: long-term forecasting with transformers. In: International Conference on Learning Representations (2023)
Google Scholar
Rangapuram, S.S., Seeger, M.W., Gasthaus, J., Stella, L., Wang, Y., Januschowski, T.: Deep state space models for time series forecasting. Adv. Neural Inf. Process. Syst. 31 (2018)
Google Scholar
THUML: ETDataset: GitHub Repository. https://github.com/zhouhaoyi/ETDataset (2023). Accessed 22 Mar 2023
Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017). https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Wen, R., Torkkola, K., Narayanaswamy, B., Madeka, D.: A multi-horizon quantile recurrent forecaster. In: NeurIPS (2017)
Google Scholar
Wu, H., Xu, J., Wang, J., Long, M.: Autoformer: decomposition transformers with auto-correlation for long-term series forecasting. Adv. Neural Inf. Process. Syst. (2021)
Google Scholar
Zeng, A., Chen, M., Zhang, L., Xu, Q.: Are transformers effective for time series forecasting? (2022). https://arxiv.org/abs/2205.08459
Zhou, H., et al.: Informer: beyond efficient transformer for long sequence time-series forecasting. In: Proceedings of the AAAI Conference on Artificial Intelligence (2021). https://doi.org/10.48550/arXiv.2012.07436
Zhou, T., Ma, Z., Wen, Q., Wang, X., Sun, L., Jin, R.: Fedformer: frequency enhanced decomposed transformer for long-term series forecasting (2022)
Google Scholar

Download references

Author information

Authors and Affiliations

Wakayama University, 930, Sakaedani, Wakayama-city, Wakayama, 640-8510, Japan
Rikuto Yamazono & Hirotake Hachiya
Center for AIP, RIKEN, Nihonbashi 1-chome Mitsui Building, 15th floor, 1-4-1, Nihonbashi, Chuo-ku, Tokyo, 103-0027, Japan
Hirotake Hachiya

Authors

Rikuto Yamazono
View author publications
You can also search for this author in PubMed Google Scholar
Hirotake Hachiya
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rikuto Yamazono .

Editor information

Editors and Affiliations

LTCI, Télécom Paris, Palaiseau Cedex, France
Albert Bifet
KU Leuven, Leuven, Belgium
Jesse Davis
Faculty of Informatics, Vytautas Magnus University, Akademija, Lithuania
Tomas Krilavičius
Institute of Computer Science, University of Tartu, Tartu, Estonia
Meelis Kull
Department of Computer Science, Bundeswehr University Munich, Munich, Germany
Eirini Ntoutsi
Department of Computer Science, University of Helsinki, Helsinki, Finland
Indrė Žliobaitė

Ethics declarations

Disclosure of Interests

The authors have no competing interests to declare that are relevant to the content of this article.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yamazono, R., Hachiya, H. (2024). Permutation Dependent Feature Mixing for Multivariate Time Series Forecasting. In: Bifet, A., Davis, J., Krilavičius, T., Kull, M., Ntoutsi, E., Žliobaitė, I. (eds) Machine Learning and Knowledge Discovery in Databases. Research Track. ECML PKDD 2024. Lecture Notes in Computer Science(), vol 14943. Springer, Cham. https://doi.org/10.1007/978-3-031-70352-2_18

Download citation

DOI: https://doi.org/10.1007/978-3-031-70352-2_18
Published: 22 August 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70351-5
Online ISBN: 978-3-031-70352-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)

Permutation Dependent Feature Mixing for Multivariate Time Series Forecasting

Abstract

Keywords

1 Introduction

2 Formulation and Related Works

2.1 Formulation

2.2 Attention Mechanisms in Time Series Forecasting

2.3 TSMixer: An All-MLP Architecture for Time Series Forecasting

3 Proposed Method

3.1 Frequency-Aware Mixer (FAM)

3.2 Event Proximity-Aware Mixer (EPAM)

3.3 Entire Architecture and Training

4 Experimental Evaluation

4.1 Setting and Comparative Methods

4.2 Datasets

4.3 Result

4.4 Analysis

5 Conclusion

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Ethics declarations

Disclosure of Interests

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation