1 Introduction

Time series prediction plays a vital role in numerous domains, including meteorology [1,2,3], finance [4, 5], engineering [6, 7], and industry [8,9,10]. It can not only use a large amount of historical data to make a reasonable analysis of the past system state space in the fields mentioned above, but also make a crucial and effective prediction for future series. However, the actual prediction tasks are often accompanied by the challenges of long-time series and multivariate collaboration [11, 12]. For example, in meteorological prediction tasks, pollutant concentration shows a trend of long-term accumulation in space with time. At the same time, pollutant concentration is often affected by many factors, such as season, weather, and factory emissions [13].

A large number of data-driven models have been proposed to mine the regularities and patterns of the series. Autoregressive series models have been utilized for time series prediction, including auto regression (AR) [14] and ARIMA [15]. However, this kind of model [14, 15] cannot be applied to predict future trends with multiple variables. With the rapid development of deep learning, an increasing number of neural network models have been proposed to predict time series. Recurrent neural network (RNN) has been used to perform time series prediction [16, 17]. However, RNN encounters significant challenges with gradient vanishing and exploding. Besides, since the model does not contain long-term memory, it is difficult to achieve significant results on long-term time series. Some models [18,19,20,21] have been developed to address the issue of an unstable gradient. A gated unit has been introduced in [18,19,20,21] to constrain the loss of long-term memory. It makes the gradient more stable. In addition, naïve LSTM [19] added a forget gate to reinforce the importance of short-term memory. However, the long-term memory loss problem still exists in the models [18,19,20,21]. With the continuous growth of the sequence, the prediction effect of the model is gradually becoming worse [21]. An encoder–decoder architecture has been designed to predict time series in [22]. It alleviates long-term dependency loss and enables the prediction of arbitrary step sizes through an autoregressive architecture. MS-LSTM [23] adds an attention mechanism to ensure the effectiveness of the prediction results. Att-LSTM [24] superimposes the attention mechanism to get global correlation for improving the performance of the model in long-term prediction. However, it is impossible to extend the spatial distance of dependency extraction to infinite lengths in [22,23,24] due to their causal architectures.

With the development of deep learning, more and more transformer-based [25] deep learning methods have been proposed for time series prediction, such as Transformer [26] and ConvTrans [27]. However, transformer-based methods [26, 27] only take atomic time-scale data as input without different fine-grained feature fusions. They cannot make accurate predictions without rich contextual information. At the same time, the multidimensional input variables are embedded into the feature representation by a simple linear transformation [26, 27], which makes them impossible to mine the correlation between variables.

A novel time series prediction method has been developed in the multiscale convolutional neural-based transformer network (MCTNet). It is composed of multiscale extraction (ME) and multidimensional fusion (MF) frameworks. The ME framework contains both a multiscale convolutional feature extractor and a temporal attention-based representator. It can be used to mine different time-scale dependencies. In order to use the correlation among variables sufficiently, the MF framework has been designed to capture the relationship among inputs in a spatial attention-based highway mechanism. The linear elements of the input sequence are effectively preserved in MF. Experimental results demonstrate that the MCTNet has distinguished performance for time series prediction in comparison with some state-of-the-art methods. Some of the main contributions are summarized as follows: Firstly, a ME framework has been proposed to mine the context dependency over different time lengths. It contains both a multiscale convolutional feature extractor and a temporal attention-based representator. It can be applied to extract context dependence for different timescales. Secondly, a MF framework has been developed to capture the relationship among inputs by utilizing a spatial attention-based highway mechanism. It can be used to learn patterns among different variables. The linear elements of the input sequence are effectively preserved in MF. Finally, and importantly, MCTNet can learn the sequence representations at different timescales and adaptively fuse representations based on each contribution level. Experiments highlight that MCTNet has outstanding performance for time series prediction based on comparisons with some state-of-the-art methods on challenging datasets.

The rest of this paper is as follows: MCTNet-based time series prediction is described in Sect. 2. Experimental analyses and discussions are introduced in Sect. 3 and followed by conclusions in Sect. 4.

2 MCTNet-based time series prediction

In terms of common sense, deep learning models contain a large number of nonlinear activation functions. It makes them able to deal with sophisticated tasks. However, the large superposition of nonlinear transformations results in models that often struggle to capture linear dependencies in time series. MCTNet has been designed with the idea of parallel architecture with both nonlinear and linear extractors, which makes the different dependencies not discarded by the model as shown in Fig. 1. It provides a brief MCTNet architecture, including both ME and MF in parallel. MCTNet profits from highly parallelized computations of the transformer encoder [25] with much lower run-time overhead, which is unlike the autoregressive generation method [21] with turgid time complexity.

Fig. 1
figure 1

A workflow of MCTNet for time series prediction

Since ME has adopted the transformer encoder [25] to capture the information in different time steps globally by the self-attention mechanism, the problem of long-term dependencies ambiguity can be solved. MF has been used to preserve linear dependencies from the time series data, which is different from the ME block used to catch nonlinear dependencies from the data. A highway network in MF has been applied to add linear elements from input time series data for the final prediction. A spatial attention mechanism has been added to enhance the correlationship among different variables. MCTNet can efficiently perform time series prediction on different timescales by combining ME and MF together.

The multi-step prediction result of MCTNet can be defined as:

$$ {\mathbf{Y}} = {\mathbf{L}}_{o} + {\mathbf{G}}_{o} + {\mathbf{H}}_{o} $$
(1)

where Y ∈ RN×F is the output of the MCTNet, N stands for the length of the predict output sequence, and F stands for the dimension of the predict output sequence. The sum of Lo ∈ RN×F and Go ∈ RN×F represents the output of the block, and Ho is the output of MF block.

2.1 Multiscale extraction mechanism

A novel feature extraction structure has been designed to mine the nonlinearity in multiscale time series data. It combines a transformer encoder [25] with a multiscale feature fusion block. At first, the ME has been used to extract different ranges of dependencies from the input sequence. The transformer encoder [25] has been selected to further process some high-dimensional features.

2.1.1 Transformer

The canonical transformer model utilized an encoder–decoder framework as seq2seq model [28]. The encoder is made up of encoding layers that process the input recursively, while the decoder is made up of decoding layers that process the encoder’s output recursively. Both the encoder and decoder process input tensors layer by layer.

A transformer-based natural language processing model has been proposed inspired by BERT [29], that only uses the encoder to learn the representations from a bulk of natural language datasets. We chose the transformer’s encoder as the backbone of MCTNet for time series dependency extraction. As a result, the computational cost is reduced, and the results of multiple time step predictions can be obtained by only one inference.

The self-attention mechanism serves as the encoder’s central component. Each self-attention block is composed of multiple scaled dot-product attention mechanism, as illustrated in Fig. 2a [25].

Fig. 2
figure 2

a Scaled dot-product attention, b Multi-head attention

However, when encoding information at the current position by single scaled dot-product attention, the model will excessively focus on its own position and ignore other positions’ information. Therefore, the model with multi-head attention jointly attends to information from several representation subspaces at different locations, as shown in Fig. 2b. Meanwhile, the multi-head self-attention mechanism supports parallel computing.

A multi-head attention process is defined as follows:

$$ {\text{MultiHead}}({\mathbf{Q}}, \, {\mathbf{K}}, \, {\mathbf{V}}) = {\text{Concat}}({\text{head}}_{1} , \ldots ,{\text{head}}_{h} ){\mathbf{W}}^{o} $$
(2)

where Q, K, V denote query, key, and value matrix, respectively. Each of these matrices represents a high-dimensional input embedding. Wo ∈ Rd×d is the self-attention output matrix used as a linear transformation after concatenating the individual heads. And d represents the embedding dimension of the input sequence.

The attention score of each head is defined as:

$$ {\text{head}}_{i} = {\text{Attention}}\left({\mathbf{QW}}_{i}^{q} , \, {\mathbf{KW}}_{i}^{k} , \, {\mathbf{VW}}_{i}^{v} \right) $$
(3)
$$ {\text{Attention}}({\mathbf{Q}}, \, {\mathbf{K}}, \, {\mathbf{V}}) = {\text{softmax}}\left(\frac{{{\mathbf{QK}}^{{\text{T}}} }}{\sqrt d }\right){\mathbf{V}} $$
(4)

where Wiq ∈ Rd ×d, Wik ∈ Rd×d, Wiv ∈ Rd×d are the parameter matrices that can be learned to map Q, K, V into the high-dimensional representation space, respectively. And the square root of d is used to linearly scale to ensure the stable propagation of the gradient.

The softmax nonlinear function is defined as:

$$ {\text{softmax}}(x_{m} ) = \frac{{e^{{x_{m} }} }}{{\sum\nolimits_{n} {e^{{x_{n} }} } }}, $$
(5)

Due to the over-the-horizon advantage of transformer [25], correlation links for all time steps can be established without the restriction of distance. However, it also makes itself unable to learn causal temporal trends. Therefore, the positional embedding is added to capture temporal information for the model.

2.1.2 Local and global dependency extraction

Time series context dependency at different scales is typically presented in distinct time series datasets. A novel multiscale sequence dependency extraction strategy has been proposed to improve the generalization performance of the model with different dependency scales. The structure of ME is shown in Fig. 3a.

Fig. 3
figure 3

a The structure of ME, b Local and global feature attention fusion

The ME module consists of a local dependency extraction convolution kernel and a global dependency preadaptive extraction convolution kernel. Local convolution kernel and global preadaptive convolution kernel are used to capture short-term and long-term dependency, respectively. The size of each convolutional filter is not identical. Their parameter weights are not shared. The results of these two kernels will be fused by the attention fusion shown in Fig. 3b.

The local feature extractor uses convolutional kernel of moderate size and short stride, which are sufficient to collect comprehensive time series data to mine dense context dependencies. On the contrary, the global feature preadaptive extractor has a wide receptive field, which takes advantage of the sparse convolutional kernel to acquire latent long-term dependencies. The filter size and slide stride of the global feature preadaptive extractor are more than twice those of the local feature extractor. The input series are encoded as high-dimensional global features after being distilled by the global feature preadaptive extractor. Then, maxpool is used to keep the local and global feature lengths consistent. In order to enhance the strong correlation and attenuate the weak correlation within different periods, the global feature preadaptive extractor utilizes a temporal attention mechanism to constrain feature expression. After the local extraction outputs are superimposed on the global extraction outputs, the ME block finally obtains dependency at different timescales.

The output of local convolution kernel is defined as follows:

$$ {\mathbf{L}}_{c} = {\mathbf{W}}_{l} \cdot {\mathbf{X}} + {\mathbf{b}}_{t} , $$
(6)

where Lc ∈ Rd*f_len is the output of local convolution kernel, X ∈ Rd*f_len, f_len is the temporal length of the input sequence, Wl is a parameter matrix of local convolutional kernel, and bl is a bias vector.

The output of local feature extractor process is described as follows:

$$ {\mathbf{L}}_{o} = {\text{maxpool}}({\mathbf{L}}_{c} ), $$
(7)

where Lo ∈ Rd*(f_len/2) is local output, and the maxpool operation makes the Lo become half of Lc in the temporal dimension.

The global feature preadaptive extractor process is described as follows:

$$ {\mathbf{G}}_{c} = {\mathbf{W}}_{g} \cdot {\mathbf{X}} + {\mathbf{b}}_{g} , $$
(8)
$$ {\mathbf{G^{\prime}}}_{o} = {\text{maxpool}}({\mathbf{G}}_{{\text{c}}} ), $$
(9)

where Gc ∈ Rd*(f_len/4) is the output of global preadaptive convolution kernel, Go ∈ Rd*f_len is defined as the transitory globally output, Wg is a parameter matrix of global convolutional filter, and bg is a bias.

The temporal attention mechanism in Fig. 3b is described as follows:

$$ {{\varvec{\upalpha}}} = {\text{tanh}}({\text{upsample}}({\mathbf{G^{\prime}}}_{o} )), $$
(10)
$$ {\mathbf{S}}_{t} = {\text{softmax}}({{\varvec{\upalpha}}} \cdot {\mathbf{W}}_{a} + {\mathbf{b}}_{a} ), $$
(11)
$$ {\mathbf{G}}_{o} = {\mathbf{G^{\prime}}}_{o} \odot {\mathbf{S}}_{t} , $$
(12)

where the matrix multiplication result from two matrices Wo and X is not identical in shape to the first one, so it is upsampled to ensure that both have the same dimension, α is the transitory output. St stands for attention scores at different time steps. Wa is a learnable parameter matrix, ba is preadaptive output, which is the result of Hadamard product Go with St.

The results obtained by the above process are inputted into the transformer encoder [25] layer to further mine the nonlinear correlation in the time series for the final prediction.

2.2 Multidimensional feature fusion mechanism

While ME has advantages in extracting nonlinear dependencies in sequences, it ignores some insignificant linear information. MF mechanism has been proposed inspired by [30, 31], which combines highway mechanism [31] with spatial attention to enhance the representation of linear features. Similar to the temporal attention mechanism of the ME component, the calculation is the same. However, while the former adaptively weights different time steps in the temporal dimension, the latter adaptively weights the importance of input multivariate covariates in the feature dimension. The weighted high-dimensional vectors are compressed into a single dimension by adding them. To preserve the linear element of the representations to a great extent, only linear layers without nonlinear activation functions are used to transform the representations. The output of MF is as shown in Fig. 4.

Fig. 4
figure 4

The structure of MF

Take Fig. 4 as an example, the feature dimension of the input sequence is 4. It means that there are four separated feature sequences as input. Define the input vector at the ith time step as Fi = [f0i, f1i, f2i, f3i]. The spatial attention process is described as follows:

$$ {\mathbf{s}} = {\text{tanh}}({\mathbf{F}}_{i} ), $$
(13)
$$ {\mathbf{A}}^{i} = {\text{softmax}}({\mathbf{s}} \cdot {\mathbf{W}}_{f} + {\mathbf{b}}_{f} ), $$
(14)
$$ h^{i} = \, {\mathbf{F}}^{i} \cdot {\mathbf{A}}^{i} , $$
(15)

where s is the transitory output, Ai ∈ R4*1 stands for the attention score of Fi, Wf is a learnable parameter matrix, bf is an attention bias vector, and hi is the output representation at ith time step.

3 Experimental analyses and discusses

In order to verify the effectiveness of the MCTNet, experiments have been done on some challenging time series datasets with different dependency scales and covariate dimensions. The experiment is conducted in a server with Ubuntu 20.04 LTS operating system, an Intel(R) i7-7700 CPU, a Nvidia(R) Geforce RTX 4070ti GPU, and PyTorch 1.11.0.

3.1 Datasets

Electricity Transformer Temperature [12] (ETT) contains 1 h-level and 15 min-level electrical transformer data. The ETT [12] dataset contains seven attributes with six power load features and an oil temperature as the prediction target. The variation trend of electrical load often shows a certain recent dependency, it has a strong correlation with recent historical data. Electrical data tests the ability of the model to extract short-term dependencies.

Beijing Multi-Site Air-Quality Dataset [32] (AQI) contains hourly air pollution data with 18 attributes and 420,768 time steps. It reflects seasonal trends to some extent. The variation tendency of air quality has an obvious periodicity, and the change of season will affect the weather and air quality to a large extent. At the same time, the temperature and humidity show significant periodicity in the short term on the scale of days. Besides, the proportion of missing values in the AQI dataset [32] is as high as 25.67%. Such predictions with strong long short-term correlations are a test for the model.

Grottoes Physics Properties (Grottoes) is another dataset that is significantly different from the above ones [12, 32]. It is collected from the Yungang Grottoes in China, UNESCO World Heritage site. The Grottoes dataset contains six attributes including mass, rock stress, chromatic aberration, ultrasonic conduction velocity, hardness, and magnetic susceptibility with 14,400 time steps. The changes in the physical properties of the Grottoes are accumulated over a long period of time, which is observably different from the previous two datasets [12, 32]. The predicted values strongly correlate with extremely long-term historical data. It requires the model to capture extremely long-time dependencies.

3.2 Metrics and parameters discusses

In order to get a fair comparison with some state-of-the-art, the first 128 time steps of each sequence at the mentioned above datasets are used to forecast the next 16 time steps. Mean square error (MSE), mean absolute error (MAE), and Pearson’s correlation coefficient (CORR) are selected to evaluate the models. A stronger prediction performance is indicated by a higher CORR as well as lower both MSE and MAE values.

$$ {\text{MSE}} = \frac{1}{n}\sum\limits_{n}^{i = 1} {(y_{i} - \widehat{{y_{i} }})^{2} } , $$
(16)
$$ {\text{MAE}} = \frac{1}{n}\sum\limits_{n}^{i = 1} {\left| {y_{i} - \widehat{{y_{i} }}} \right|} , $$
(17)
$$ {\text{CORR}} = \frac{{\sum\nolimits_{n}^{i = 1} {(y_{i} - \overline{{\mathbf{Y}}} )} (\widehat{{y_{i} }} - \overline{{{\hat{\mathbf{Y}}}}} )}}{{\sqrt {\sum\nolimits_{n}^{i = 1} {(y_{i} - \overline{{\mathbf{Y}}} )^{2} } } \sqrt {\sum\nolimits_{n}^{i = 1} {(\widehat{{y_{i} }} - \overline{{{\hat{\mathbf{Y}}}}} )^{2} } } }}, $$
(18)

where yi is the output value of the model, ŷi is the target value of time step i, and n is the quantity of samples in the sequence. In F, \(\widehat{{\varvec{Y}}}\) and \(\overline{\widehat{{\varvec{Y}}} }\) stand for the mean of the predicted sequence and the mean of the target sequence.

The proportional segmentation method is applied to generate the train dataset and test dataset, to ensure their causality. The first 80% of the time series is used as the train dataset, and the last 20% is used as the test dataset to evaluate the performance of different models. The learning rate and batch size are set to 0.001 and 512, respectively. The model has undergone 5000 training iterations.

3.3 Prediction performance analyses in different feature dimensions

Low embedding dimension has superior generalization performance for reasoning the test dataset though, it will lead to poor performance for learning nonlinear patterns from complex massive data. Conversely, both generalization and learning performances exhibit an opposite trend as the dimension increases. In order to get a reasonable embedding dimension for transformer encoder layer d in (1), d changes from 64 to 384 at interval 64. Some experimental results for different ds are given in Figs. 5, 6, 7, respectively.

Fig. 5
figure 5

Prediction performances in different ds at the ETT dataset [12]

Fig. 6
figure 6

Prediction performances in different ds at the AQI dataset [32]

Fig. 7
figure 7

Prediction performances in different ds at the Grottoes dataset

One can find from Figs. 5, 6, 7 that the prediction performances are the best at the chosen datasets [12, 32], and the Grottoes dataset when the embedding dimension is 128 as a whole. The model does not achieve optimal performance when d is 64. When d increases to 128, the prediction performance on each dataset tends to be optimal. As the value of d increases, the model overfits on each dataset, and the prediction performance deteriorates sharply. It can be seen that the performance level of the same indicator on different datasets is not completely consistent, even under the condition of keeping the embedding dimension constant. The reason is that the data distribution and feature correlation of different datasets are inconsistent, which leads to uncertainty in performance. The embedding dimension d in (1) is set to 128 and kept the subsequent experiments.

3.4 Transformer self-attention analyses

To investigate the contribution of different locations in the sequence contribute to the final prediction, the heatmaps of the self-attention scores for different time steps are shown in Fig. 8.

Fig. 8
figure 8

Self-attention score heatmaps at disparate datasets

The horizontal and vertical axes represent keys and queries at different time steps in Fig. 8, respectively. The shade of color represents the attention score, that is, the contribution of different time steps to the final prediction. Locus with lighter colors represents higher contributions, and vice versa. It shows the self-attention scores on the ETT [12], AQI [32], and Grottoes datasets from left to right, respectively.

The highlights are mainly distributed in the interval 112–128 for ETT [12] dataset. It indicates that the prediction of its trend depends on recent data. However, the model is not interested in the short-term data for AQI [32] dataset. On the contrary, the model prefers the relatively long-term historical data, and its brightness distribution shows a certain period of superposition. The contribution degree of different time steps increases with time for the Grottoes dataset with long-time dependencies. It indicates that the weight of each time step requires to be balanced in the model.

Among them, the prediction of MCTNet for ETT [12] dataset is strongly correlated with the recent historical data, while the contribution of the distant historical data is weak. The MF compensates for the historical data over a period of time in the prediction output. This enables the model to learn the trend of data change in the recent period. The model is more dependent on distant historical data for AQI [32] dataset. The importance degree also reflects a certain periodicity from Fig. 8. Multiscale convolution filters and attention fusion in ME enable MCTNet to mine the correlation of its different periods. It improves the prediction’s effectiveness. For the Grottoes dataset with a slow trend, the scores of different time steps also exhibit a smooth change. The self-attention mechanism in MCTNet captures the dependencies of long-term with adaptive weighting for different time steps.

3.5 Ablation study

To further test the performance of MCTNet, ablation experiments have been done on the ETT [12] and AQI [32] datasets, respectively. Some experimental results are shown in Table 1. The optimal results in Table 1 are highlighted in boldface.

Table 1 Ablation experimental results with different models at different datasets

One can find from Table 1 that our proposed model and mechanism including ME-Conv, ME-Att, MF, and MCTNet help to improve the prediction performance, and MCTNet has the best prediction performance. Some reasons are as follows: ME preadaptive feature extractor is mainly composed of a multiscale convolution filter and an attention fusion block. ME degrades to a naive linear encoder after removing the convolution filters, namely ME-Conv. ME degenerates to a simple local information extractor after removing multiscale convolution filter and temporal attention-based fusion, namely ME-Att. MF is composed of a spatial attention-based highway network. The introduction of the MF settles the problem of missing linear elements and strengthens the extraction of short-term information. Both ME and MF leverage multiscale and multidimensional information to make more precise predictions. To verify that maxpool operation in the attention fusion block has optimal performance, maxpool has been replaced by avgpool and power-average pool for ablation study, corresponding to ‘(w/) AvgPool’ and ‘(w/) PAPool’ in Table 1, respectively. Experimental results show that the performance of MCTNet is worse than that of maxpool when avgpool is used. As a result, every portion of the suggested model contributes in a specific way to overall performance.

3.6 Comparisons with some state-of-the-art methods

To further evaluate the prediction performance for MCTNet, some state-of-the-art methods have been chosen for a fair comparison across different datasets including LSTM [33], Att-BLSTM [34], Transformer [35] ,and ConvTrans [27]. To get a fair comparison, all the corresponding parameters used are the authors’ recommended ones for each method. The experiment has been done with six different prediction lengths of data with an input time series length of 128 at ETT [12], AQI [32], and the Grottoes datasets. Some prediction performances at different lengths are shown in Figs. 9, 10, and 11, respectively.

Fig. 9
figure 9

Prediction performance in different lengths at the ETT dataset [12]

Fig. 10
figure 10

Prediction performance in different lengths at the AQI dataset [32]

Fig. 11
figure 11

Prediction performance in different lengths at the Grottoes dataset

It can be found that MCTNet has the best performance among the investigated methods [27, 33,34,35] from Figs. 9, 10, 11, respectively. MCTNet has the lowest MSE, MAE, and the highest CORR in different lengths as a whole. To further demonstrate the MCTNet has the best prediction performance, we calculate the average MSE, MAE, and CORR for six different prediction lengths from Figs. 9, 10, 11. Some comparative results are given in Table 2. The optimal results in Table 2 are highlighted in boldface. One can find from Table 2 that MCTNet has the best prediction performance among the investigated approaches [27, 33,34,35]. Some reasons are as follows: Sequential dependency extraction is achieved by a causal approach in LSTM [33] and Att-BLSTM [34]. The pattern learned at the current time step must depend on the previous historical time steps, which makes the prediction error steadily accumulate. As the length of the sequence increases, the results predicted by LSTM [33] and Att-BLSTM [34] are more and more deviated from the actual target. It can be shown in Figs. 9, 10, 11, respectively. Transformer [35] and ConvTrans [27] can ignore the distance information in space and overcome the problem of error accumulation. Transformer [35] and ConvTrans [27] have superior prediction performance for different time-dependent tasks. However, Transformer [35] and ConvTrans [27] cannot mine efficient representations from sequences with different timescales and dimensions. MCTNet takes advantage of information at different timescales, which can efficiently predict outcomes by accounting for covariate relationships.

Table 2 Some comparative results in different methods

To verify the robustness of the MCTNet, we conducted a variety of experiments with different inputs and different output lengths. The results are shown in Table 3. Among them, the values of the length index: ‘16 ≥ 8,’ ‘32 ≥ 16,’ ‘64 ≥ 32’ represent the input length of 16, 32, 64 and the output length of 8, 16, 32, respectively. The results show that MCTNet can maintain great prediction performance even when the length of input and output of time series data changes. It is proved that the model has optimal robustness.

Table 3 Some comparative results with different methods in multiple sequence lengths

Figure 12 shows the prediction curves of multiple methods [27, 34, 35] on the ETT dataset. The average deviation of MCTNet in the whole prediction time window is smaller than that of other models. Profit from the over-the-horizon ability of the self-attention architecture, MCTNet can still maintain stable prediction results with the growth of the prediction time step. The experimental results from Tables 2 and 3 further show that MCTNet outperforms some state-of-the-art methods by comparisons on different challenging datasets.

Fig. 12
figure 12

Predict curves of multiple methods

4 Conclusions

A novel method has been developed for time series prediction in MCTNet. It is composed of ME and MF frameworks. The ME framework contains both a multiscale convolutional feature extractor and a temporal attention-based representator. It can be used to mine different time-scale dependencies. The MF component simultaneously mines the correlation between different covariates to improve the model’s capacity for linear expression. The relationship among inputs can be captured by utilizing a spatial attention-based highway mechanism in MF. Experimental results demonstrate that MCTNet has outstanding performance for time series prediction in comparison with some state-of-the-art methods at challenging datasets.

In feature, we will expand the covariate prediction dimension so that we can handle time series data with a large number of input dimensions. At the same time, the computational structure is optimized to ensure low computational overhead even in the case of sharp data growth.