Introduction

The future direction of rapidly developing aquaculture focuses on intensive and intelligent solutions. In the complex, dynamic, and nonlinear aquaculture systems, the quality of water is affected by many factors. Human management alone is not enough to respond to the rapid change of water quality in time that restricts the sustainable development of aquaculture. As one of the key indicators of water quality, the concentration of dissolved oxygen affects the healthy growth of cultured species directly. Keeping the dissolved oxygen content in a proper range improves feed efficiency, reduces the stress response of aquaculture species, and ensures the economic benefits of rearing. Therefore, it is of great practical significance to establish a prediction model of dissolved oxygen, based on historical data, and to predict the trend of dissolved oxygen content accurately.

There are many water quality variables in aquaculture, while the water quality information exists in the form of multivariate time series datasets. Water quality changes periodically in time. Many scholars carried out relevant exploration and in-depth research to solve the problem of water quality prediction worldwide. The common methods for water quality prediction include artificial neural networks (ANN), regression analyses (RA), grey model (GM), and support vector regressions (SVR) (Zhou et al. 2018), as well as other methods (Rahman et al. 2020). Zhu et al. (2017) used the model of least squares support vector regression (LSSVR) with fruit fly optimization algorithm (FOA) to predict dissolved oxygen. Girija and Mahanta (2010) applied and compared artificial neural network and Mamdani’s fuzzy logic control for the prediction of dissolved oxygen in an effluent-impacted urban river. The results showed that the neural network had better prediction performance. Xiao et al. (2017) used the traditional method to predict the dissolved oxygen in the pond and compared the algorithms according to the prediction accuracy. The results showed that the neural network was better than auto-regression (AR), grey model (GM), support vector machines (SVM), and curve fitting (CF). Ji et al. (2017) used support vector machine (SVM) to predict the dissolved oxygen concentration in a hypoxic river in South-Eastern China, and this method provided the optimal performance. The traditional models have some shortcomings, such as local optimal solutions, poor stability, and generalization ability. The self-learning characteristics of neural networks and their ability to process nonlinear information make up for the shortcomings of traditional dissolved oxygen time series models. Accordingly, they are widely used for dissolved oxygen prediction.

In recent years, deep learning method has made significant progress in improving prediction accuracy in time series prediction (Qin 2019; Dabrowsk et al. 2020). For the prediction of dissolved oxygen in aquaculture, the long short-term memory (LSTM) has good performance in deep learning models (Liu et al. 2019; Dabrowski et al. 2018). LSTM can memorize the previous data; thus, it has some advantages in dealing with time series prediction. It is widely used in machine translation (Cui et al. 2015; Wang et al. 2016), natural language processing (Yao and Huang 2016; Yao and Guan 2018), time series prediction (Shi et al. 2015; Li et al. 2019), and other fields. Chen et al. (2018) proposed a prediction model of dissolved oxygen in aquaculture based on principal component analysis (PCA) and long short-term memory (LSTM) neural network, which has better prediction performance and generalization ability than traditional prediction methods. Hu et al. (2019a, b) proposed a water quality prediction method, based on the deep LSTM learning network for maricultural environment. Yang et al. (2018) established an end-to-end and trainable LSTM neural network model, which combined temporal and spatial information to realize the prediction of sea surface temperature. Its improvement and combination with other algorithms are more and more widely used in water quality prediction, and the experimental results show that the combined model has more reliable performance and higher prediction accuracy, than the single one. Accordingly, Li et al. (2018) proposed a hybrid model based on sparse auto-encoder (SAE) and long short-term memory (LSTM) network. Barzegar et al. (2020) established a hybrid model of long short-term memory (LSTM) and convolutional neural network (CNN). Yuan et al. (2018) investigated the accuracy of hybrid long short-term memory neural network and ant lion optimizer model (LSTM-ALO) in prediction of monthly runoff.

Recently, temporal convolutional network has been proposed to develop further time series prediction. The results show that convolution architectures can outperform recurrent networks for various datasets and tasks, and the temporal convolutional network has better performance in the sequence modeling task (Bai et al. 2018). Ta and Wei (2018) proposed a simplified reverse understanding convolutional neural network (CNN) to predict dissolved oxygen, which had faster convergence rate and better stability, than BP network. Zhao et al. (2019) proposed a new method combining graph convolutional network (GCN) and gated recurrent unit (GRU), which can capture the spatial and temporal dependences to achieve accurate and real-time traffic prediction. Deng et al. (2019) proposed a novel knowledge-driven temporal convolutional network (KDTCN) to predict the future trend of stocks.

Related experiments show that RNNs and CNNs can simulate complex nonlinear feature interactions and have good performance for time series processing (Chen et al. 2019). How to utilize the historical data of water quality time series effectively is still a challenge.

In this paper, a dissolved oxygen prediction model, based on the combination of long short-term memory network and temporal convolutional network, is proposed. Long short-term memory network has the ability to deal with long-term dependence of complex time series. Temporal convolutional network solves the problem of concurrency in the LSTM network and improves the flexibility of model structure. The structure of dilations and causal convolutions in the network make the model suitable for time series data of large receiving field with advanced calculation. After preprocessing, the water quality data pass through the LSTM and TCN network model stepwise, and finally the tool calculates the prediction results. In parallel, the comparative experiments of LSTM, CNN-LSTM, and TCN algorithms are carried out. The influence of attention mechanism on prediction and the change of prediction accuracy caused by the size of historical time window are discussed. The experimental results demonstrate that the combined LSTM-TCN has better dissolved oxygen prediction performance, than other algorithms.

The rest of the paper is organized as follows: The basic principles of LSTM and TCN and the overall framework of our proposed combined method are summarized in the “Methodology” section. “Experiments” presents the application and results of the model to the prediction of dissolved oxygen in the actual industrial aquaculture, comparing it with other algorithms. Finally, the “Conclusion” section summarizes the findings of the work.

Methodology

This section introduces the basic principles of LSTM and TCN, and shows the framework and overview of the proposed model. Next, the combined algorithm of LSTM-TCN is introduced in detail.

Long short-term memory

Recurrent neural network (RNN) has a network structure in which the nodes of hidden layer are connected with each other along multiple times. Therefore, the network can generate the memory state of historical data, can establish the dependence relationship between the input data at different times, and is applicable to the current output calculation. Based on this feature, RNN is suitable for mining time series information (Hu et al. 2019a, b). The network structure is shown in Fig. 1, illustrating how RNN expands by time series. RNN network architecture includes only three parts, namely input layer, hidden layer, and output layer. For a given time series \(x=\left({x}_{1},{x}_{2}, ,{x}_{n}\right)\) the expected output series \(y=\left({y}_{1},{y}_{2}, ,{y}_{n}\right)\) can be obtained by calculation. For example, at time t we apply the calculation formulas, as follows.

Fig. 1
figure 1

Network structure diagram of RNN unit

$${h}^{t}=f\left({W}_{xh}{x}^{t}+{W}_{hh}{h}^{t-1}+{d}_{h}\right)$$
(1)
$${y}^{t}={W}_{hy}{h}^{t}+{d}_{y}$$
(2)

Here \({W}_{xh}\) is the weight coefficient matrix from the input layer to the hidden layer; \({W}_{hh}\) is the weight coefficient matrix for the hidden layer; \({W}_{hy}\) is the weight coefficient matrix from the hidden layer to the output layer; \({d}_{h}\) and \({d}_{y}\) are the bias vectors of hidden layer and output layer respectively; \(f\) is the nonlinear activation function.

With increasing input time series, RNN has a long-term dependence problem, and will gradually forget the input information appeared a long time ago. In gradient back-propagation, a gradient explosion appears in the training process. To solve this problem, LSTM algorithm was first proposed by Hochreiter and Schmidhuber (1997). As an improved RNN, the gating mechanism is added to improve the phenomenon of gradient disappearance. Compared with the RNN algorithm, LSTM has better performance in longer time series. The input of the current hidden layer consists of the output of the previous hidden layer and the input layer of the current time.

Figure 2 shows the structure of LSTM hidden layer neurons, illustrating the calculation process of current hidden layer neurons. Each hidden layer neuron contains input gate, forget gate, output gate, and current state.

Fig. 2
figure 2

Structure diagram of LSTM hidden layer neurons

Actually, the input gate determines how much input of the current time network is saved into the unit state, i.e., how much information is updated into the neuron. The calculation formula is as follows:

$${i}_{t}=\sigma \left({W}_{i}\bullet \left[{h}_{t-1},{x}_{t}\right]+{b}_{i}\right)$$
(3)

The forgetting gate determines which information is discarded from the cell state, i.e., how much information of the last cell state \({c}_{t-1}\) is retained in the current time \({c}_{t}\). The calculation formula is as follows:

$${f}_{t}=\sigma \left({W}_{f}\bullet \left[{h}_{t-1},{x}_{t}\right]+{b}_{f}\right)$$
(4)

The output gate controls the output at the current time \(c_{{\text{t}}}\) and determines the final output information. The calculation formula is as follows:

$$\genfrac{}{}{0pt}{}{{o}_{t}=\sigma \left({W}_{o}\bullet \left[{h}_{t-1},{x}_{t}\right]+{b}_{o}\right)}{{h}_{t}={o}_{t}*\mathrm{tanh}\left({c}_{t}\right)}$$
(5)

The calculation formula of cell state \(c_{{\text{t}}}\) is the following:

$$\genfrac{}{}{0pt}{}{{\tilde{c }}_{t}=tanh\left({W}_{c}\bullet \left[{h}_{t-1},{x}_{t}\right]+{b}_{c}\right)}{{c}_{t}={f}_{t}\bullet {c}_{t-1}+{i}_{t}\bullet {\tilde{c }}_{t}}$$
(6)

Here, \(\sigma\) is the activation function; \(W_{{\text{f}}}\), \(W_{{\text{i}}}\), \(W_{{\text{c}}}\) and \(W_{{\text{o}}}\) are the weight matrices of the forget gate, update gate, and output gate, while \(b_{{\text{f}}}\), \(b_{{\text{i}}}\), \(b_{{\text{c}}}\), and \(b_{0}\) are the corresponding bias vectors.

The memory unit structure is added to the LSTM to forget and remember historical data by combining with the gating unit. It can effectively deal with the complex long-term time series dynamic dependence, so it is suitable for the prediction of dissolved oxygen in aquaculture.

Temporal convolutional network

As a variant of the convolutional neural network (CNN), temporal convolutional network (TCN) performs better in time series prediction than the recurrent neural network (RNN) and long short-term memory (LSTM) network. It is more suitable for processing sequential data with large receptive fields and temporality (Yan et al. 2020), as it employs causal convolutions and dilations.

The model of temporal convolutional network (TCN) (Bai et al. 2018) adopts convolution form. Its network architecture is mainly composed of fully convolutional network (FCN), causal convolution, dilated convolutions, and residual block. Corresponding to each module, the network has the following characteristics:

  1. 1.

    The fully convolutional network can guarantee the same output and input sequence length;

  2. 2.

    The network architecture of causal convolution makes no information “leakage” from future to past;

  3. 3.

    The dilated convolutions and residual layer modules are used to build a deep network model, and accordingly, more historical information can be obtained for prediction.

Figure 3 is a comparison of standard convolution (left) and causal convolution (right). Different from the traditional convolutional neural network, for the prediction of time t, only the observed sequence \(\left({x}_{1},{x}_{2}, ,{x}_{t-1},{x}_{t}\right)\) will be used instead of the future information in TCN.

Fig. 3
figure 3

Standard convolution and causal convolution with kernel size 3. a For standard convolution, the kernel size and stride are 3, and the kernel number is 2. b For casual convolution, two standard convolution layers are composed. Each figure is described in b

Figure 4 adds the dilated convolution to the causal convolution, which is the dilated causal convolution with dilation factors d = 1, 2, 4 and filter size k = 3 in TCN. The use of dilated convolution makes it necessary to use padding in the TCN network to ensure that the input and output length of hidden layer and input layer are equal, while the padding size is (k − 1) × d. Actually, causal convolution ensures that the prediction at time t will not use future information, because the output of time t will only be obtained by convolution of time t and the elements from previous layers. For the sequence tasks that need to acquire a long history, the dilated convolution is used to increase the receptive field of the network. There are two ways to increase the receptive field, namely.

  1. 1.

    To increase the dilation factor d;

  2. 2.

    To select larger filter sizes k.

Fig. 4
figure 4

Dilated causal convolution with dilation factors 1, 2, and 3 for three convolution layers

The calculation formula is as follows:

$$f\left(s\right)=\left(x{\times }_{d}f\right)\left(s\right)=\sum_{i=0}^{k-1}f\left(i\right)\cdot {x}_{\mathrm{s}-\mathrm{d}\times \mathrm{i}}$$
(7)

where \(x\) is the one-dimensional sequence input and\(x\in {R}^{n}\),\(f\) is the filter and, \(f :\left\{0,\dots ,k-1\right\}\to R\) \(s\) is the element in the sequence, and \(F\) is the dilated convolution operation.

With the increase of the length of time series, TCN needs a larger receptive field to receive high-dimensional time series. The increase of network depth leads to network degradation, increasing the consumption of computing resources, gradient explosion, and gradient disappearance. The solution of this problem is to add residual module into the network. The residual network consists of a series of residual blocks, each of them consists of two parts: direct mapping part and residual part. Adding jump links to realize residual learning and using identity map as shortcut link reduce the complexity of residual network, making the deep network easier to be trained and optimized. It can be expressed as follows:

$${x}_{1+1}=h\left({x}_{1}\right)+F\left({x}_{1},{W}_{1}\right)$$
(8)

where \(F\left({x}_{1},{W}_{1}\right)\) is the residual part, and \({W}_{1}\) is the weight matrix. \(h\left({x}_{1}\right)\) is the direct mapping part, including \({1} \times {1}\) convolution operation. \(x_{{\text{l}}}\) is the input of the residual block, and \({x}_{1+1}\) is the output of the residual block.

Figure 5 shows the residual block structure in TCN. The residual block has two layers, including dilated causal convolution, weight normalization, non-linearity (ReLU), and dropout layer for regularization. If the input and output of residual have different dimensions, then it is usually solved by adding a \({1} \times {1}\) convolution. The residual network adds the identity map of cross layer connection. Instead of modifying the entire transformation, it allows the layer to modify the identity mapping. It is beneficial for TCN to build a deeper network structure that increases the stability of the TCN network.

Fig. 5
figure 5

The structure of TCN residual block

The combined model (LSTM-TCN)

Figure 6 shows the main network architecture of our proposed LSTM-TCN prediction model. After data preprocessing, the multivariable water quality data enter the LSTM network through the input layer first to extract the correlation of datasets in time dimension. Then, the time series prediction model is established by combining with TCN network. Finally, the prediction DO value goes through the full connection layer to realize the prediction of dissolved oxygen in the future. Long short-term memory (LSTM) network solves the long-term dependence problem of recurrent neural network that can capture the short-term and long-term dependence, and can be widely applied for time series problems. As a new algorithm for time series prediction, temporal convolutional network (TCN) has been proved to be better than RNN in some datasets. It can deal with large-scale data and with a larger sensing field, which enables it to retain the information memory of a longer time ago. By comparison, it uses less memory when processing the same long sequence. Therefore, based on the unique advantages of the two algorithms, we propose a fusion prediction model LSTM-TCN that combines the two algorithms.

Fig. 6
figure 6

Network architecture of LSTM-TCN

Experiments

Dataset and data preprocessing

The dataset used in this paper is the real water quality data collected from industrial recirculating aquaculture workshop by using multi-parameter water quality sensor. The sampling time is from January 1, 2019, to July 9, 2019, and the data is collected in every 10 min. The collected water quality data include six characteristics of dissolved oxygen, including water temperature, pH, turbidity, ammonia nitrogen, and water level. The collected six water quality parameters are used as input data to predict the dissolved oxygen value in the future from the historical water quality data. The dataset contains 27,077 samples: 24,370 samples are used for training and the remaining for testing. The verification set accounts for 10% of the training set.

Due to the inevitable influence of the sensor’s behavior and complex external environment, there may be missing data and abnormal values. Considering abnormal data, the method of mean inputting is used in the experiment for the improvement. Regarding missing data, we use the temporal regularized matrix factorization (TRMF) method to predict and repair the time series data, considering the time dependence. The expression of the respective formula is as follows:

$${x}_{t}\approx \sum_{l\in L}{\theta }_{1}\otimes {x}_{t-l}$$
(9)

where L is a lagged set that stores the associated distance between columns in X; \(\theta_{{\text{l}}}\) is the time series coefficient vector of \(x_{{\text{t - l}}}\).

Evaluation criteria

In the experiment, four evaluation indexes are used to evaluate the prediction results of different models, which are mean absolute error (MAE), mean absolute percentage error (MAPE), root mean square error (RMSE), and coefficient of determination (R2). The smaller the value of the first three indexes is, the better the prediction, while the closer the R2 value is to 1, the more accurate the prediction of the model. Their respective expressions are the followings:

$$\mathrm{MAE}=\frac{1}{m}\sum_{i=1}^{m}\left|{y}_{i}-{\widehat{y}}_{i}\right|$$
(10)
$$\mathrm{MAPE}=\frac{1}{m}\sum_{i=1}^{m}\left|\frac{{y}_{i}-{\widehat{y}}_{i}}{{y}_{i}}\right|\times 100\%$$
(11)
$$\mathrm{RMSE}=\sqrt{\frac{1}{m}\sum_{i=1}^{m}{\left({y}_{i}-{\widehat{y}}_{i}\right)}^{2}}$$
(12)
$${R}^{2}=1\frac{\sum_{i}{\left({y}_{i}-{\widehat{y}}_{i}\right)}^{2}}{\sum_{i}{\left({y}_{i}-{\overline{y} }_{i}\right)}^{2}}$$
(13)

where \(y_{{\text{i}}}\) is the actual value observed, \(\hat{y}_{{\text{i}}}\) is the predicted value, m is the number of test data, and \(\overline{y}_{{\text{i}}}\) is the mean value of the original observation data.

Experimental details

Prediction results of the combined model LSTM-TCN

In order to verify the better performance of the combined LSTM-TCN model for the prediction of dissolved oxygen, experiments with LSTM, TCN single model forecast, and CNN-LSTM combined model forecast were also performed. On the basis of these results, contrast analysis was conducted. In each algorithm, the historical time window was set to 2 to predict 20 min in advance. Figures 7, 8, 9 and 10 are the prediction results of models LSTM, TCN, CNN-LSTM, and LSTM-TCN, respectively. Among them, the blue curve represents the real value, while the orange curve represents the predicted value of dissolved oxygen.

Fig. 7
figure 7

Prediction results of the LSTM model

Fig. 8
figure 8

Prediction results of the TCN model

Fig. 9
figure 9

Prediction results of the CNN-LSTM model

Fig. 10
figure 10

Prediction results of the LSTM-TCN model

For the prediction of dissolved oxygen, the curve of LSTM-TCN algorithm is more consistent with the actual data, and the prediction result is more accurate, as it can be seen in Figs. 7, 8, 9 and 10.

Table 1 shows the comparison of error results of different prediction models with four evaluation indexes. The pattern of the predicted result curves and evaluation index values can be summarized as follows:

Table 1 The result value of evaluation indexes of different algorithms

Combined algorithm LSTM-TCN has better prediction than LSTM, TCN, and CNN-LSTM. The final evaluation index values of LSTM-TCN are as follows: for MAE 0.236, which improves the prediction accuracy by 36% compared with LSTM; for MAPE 0.031, the error value decreases by 93% compared with CNN-LSTM; for RMSE 0.342, which improves the prediction accuracy by 33% compared with LSTM; R2 is 0.94, which is a little bit less than in the case of algorithms (0.97). In conclusion, the combined algorithm of LSTM-TCN has better prediction and improves the prediction accuracy of dissolved oxygen to a certain extent.

Experiment with different lookback window sizes

To study how temporal convolutional network can be used for long historical data, we carried out experiments with different historical time windows to realize the prediction of dissolved oxygen. We changed time window values from 2, 4, 8, 16, 32 to 64 with the same other conditions. Table 2 shows the error results under different time window sizes.

Table 2 Evaluation index results of dissolved oxygen prediction for different lookback window size

It can be seen from Table 2 that with increasing of time window, the error evaluation index value of dissolved oxygen fluctuates up and down. The prediction results are better when the time windows are 2 and 16. The experimental results show that the TCN network has the ability to obtain more historical data, and the prediction model after adding the TCN network can also have better prediction for the larger historical time window values.

Attention mechanism

In 2014, attention model was applied to machine translation as part of the RNN framework (Bahdanau et al. 2014). With the development of deep learning, it has become an important concept in the field of neural networks, and it is widely used in image classification, speech, and machine translation (Ran et al. 2019). In recent years, more and more attention mechanism modules have been introduced into the prediction of time series. By giving weight values for different time and space data, the efficiency and accuracy of prediction may be improved.

In this section, the attention mechanism experiments are explained. The attention mechanism layer is added between the fusion model network layer and the full connection layer. The aim is to explore whether attention mechanism has an effect on the prediction of dissolved oxygen for specific scenarios and data. The fusion models CNN-LSTM and LSTM-TCN are compared in terms of attention mechanism, and the predicted results are analyzed and compared. Figure 11 shows the comparison of prediction results of CNN-LSTM combined model with and without attention mechanism. Figure 12 shows the comparison of prediction results of LSTM-TCN combined model with and without attention mechanism.

Fig. 11
figure 11

Comparison of prediction results of CNN-LSTM with and without attention mechanism. a CNN-LSTM without attention mechanism. b CNN-LSTM with attention mechanism

Fig. 12
figure 12

Comparison of prediction results of LSTM-TCN with or without attention mechanism. a LSTM-TCN without attention mechanism. b LSTM-TCN with attention mechanism

Table 3 shows the error evaluation index results of CNN-LSTM and LSTM-TCN algorithms with and without attention mechanism. It can be seen from Table 3 that after adding attention mechanism to the CNN-LSTM fusion algorithm, the error evaluation indexes MAE and MAPE decreased. For the LSTM-TCN fusion algorithm, the addition of attention mechanism does not improve the final prediction accuracy.

Table 3 The influence of attention mechanism on prediction results

Conclusion

In this work, we propose a fusion prediction model of LSTM and TCN to predict dissolved oxygen in multivariable aquaculture environment. LSTM is used to extract the sequence features in time dimension, dealing with the long-term dependence in complex time series. We built the fusion prediction model of time series with the TCN network. In addition, we also studied the influence of the size of historical time window, as well as the effect of attention mechanism on the prediction. The experiments show that the fusion model still has a good prediction performance when the time window is large. The addition of attention mechanism improves the prediction of the combined model CNN-LSTM, but does not improve the combined model LSTM-TCN. Finally, the performance of the LSTM-TCN prediction model was verified with the dissolved oxygen data in real aquaculture by comparing with the LSTM, TCN, and CNN-LSTM models. The model established in this paper has better prediction performance and higher prediction accuracy, and it can be applied to practical aquaculture, with error measures of MAE = 0.236, MAPE = 0.031, RMSE = 0.342, and R2 = 0.94. This study considers only the correlation between water quality data characteristics in time. In the future work, we will combine the water quality environmental data with the behavior data of aquaculture species, as well as consider the prediction of spatial characteristics of water quality data.