Introduction

The issue of water resources is one of the most important concerns in the world today. Surface water, also called land water, is a water resource including rivers, lakes, and swamps. Most of the water utilized by humans comes from surface water. For example, the tap water for the daily use of every household is essentially surface water processed by water plants. Although China has a top-ranking amount of water resources, the per capita share is small due to its large area and population (Sun et al. 2017). Also, human-induced pollutions caused severe water shortages. Water pollution adds to the insufficient surface water resources and seriously hinders the coordinated development of regional ecological civilization and economical construction (Yang and Chen 2020). Therefore, it is of great significance to carry out high-precision advance prediction of water quality, so as to provide scientific basis for relevant management measures.

At present, water quality prediction methods mainly include two types: one is the traditional prediction methods, that is, the classical mathematics as the theoretical basis, including time series prediction method, regression analysis, Markov model (Li and Song 2015), water quality simulation prediction method, etc.; the second is based on artificial intelligence prediction methods, including gray model, artificial neural network prediction method (Kim and Seo 2015; Chatterjee et al. 2017), support vector machine regression prediction method, etc. In recent years, with the rapid development in the field of machine learning, artificial neural networks with the advantage of nonlinear mapping have become one of the most widely used methods for water quality prediction (Azad et al. 2019; Kadkhodazadeh and Farzin 2021). However, the traditional artificial neural network has some shortcomings, such as poor generalization ability, over fitting, easy to fall into local minimum, and so on (Ding et al. 2011), which limits its application. ELM is a single hidden layer feedforward neural network proposed by Huang et al. (2006). ELM overcomes a host of shortcomings of traditional feedforward neural network algorithms, and gradually becomes a powerful tool in the field of water quality prediction with its advantages of fast learning speed and strong generalization ability (Heddam and Kisi 2017; Fijani et al. 2019).

With the change of water quality data acquisition method from manual to automated collection, the amount of water quality data accumulated over time has become very large and contains more information. Traditional neural networks, including ELM, are no longer prominent in the learning of large data volumes, and deep learning, which originates from artificial neural networks, is more advantageous in processing large data volumes (Jin et al. 2020). In addition, water quality information exists in the form of multivariate time series data sets, and the time series information in water quality information should be considered in water quality prediction. To achieve efficient processing of time series, Huang et al. (2013) pioneered the DELM, also called multilayer extreme learning machine (ML-ELM). DELM is a deep learning-based pattern recognition method formed by expanding on ELM, which first employs multiple extreme learning machine-autoencoders (ELM-AE) for unsupervised pre-training, and then uses the output weights of each ELM-AE to initialize the entire DELM (Zhang et al. 2016).

Compared with other depth methods, DELM has the advantage of fast training, but the input layer weights and biases are randomly generated orthogonal random matrices in the pre-training process of ELM-AE; meanwhile, ELM-AE uses least squares to update parameters during unsupervised pre-training, but only the output layer weights are updated, while the input layer weights and biases are not updated, which leads to the effect of DELM affected by the random input weights and biases of each ELM-AE (Pacheco et al. 2018). To overcome this drawback, domestic and foreign scholars mostly use metaheuristic algorithms to optimize the target parameters (Azad et al. 2018; Farzin et al. 2020), including cat swarm optimization (Chu et al. 2006), fruit fly optimization algorithm (Pan 2012), chicken swarm optimization (Meng et al. 2014), and dragonfly algorithm (Meraihi et al. 2020). SSA, first proposed by Xue and Shen (2020), is a new type of swarm intelligence optimization algorithm. Compared with other algorithms, it has the characteristics of high solution rate and high accuracy; however, so far, there is no research on the optimal selection of DELM hyperparameters based on SSA and its application in water quality prediction.

The presence of abnormal events in the natural environment can cause sudden changes in water quality, and the resulting noisy data can affect the data distribution of the water quality series, which greatly affects the accuracy of the model. In order to reduce the impact of noisy data, the original time series data need to be processed for noise reduction to remove the noise in the data and reduce the interference with the prediction model. The commonly used noise reduction methods include Savitzky-Golay (SG) filtering (Bi et al. 2021), wavelet noise reduction (He et al. 2008), Kalman filtering (Rajakumar et al. 2019), and singular value decomposition (Sun et al. 2021). Among them, the denoising effect of wavelet noise reduction method is influenced by the type of base wavelet and threshold selection method; Kalman filter cannot play a good noise reduction effect on nonlinear data, and it only considers Gaussian white noise, and the noise reduction effect will become worse when the interference environment noise is unknown. The singular value decomposition reconstructs the signal by selecting a suitable singular value threshold to achieve noise suppression, but inappropriate selection of singular value threshold is easy to cause signal distortion. The SWT solves the frequency mixing problem by compressing the wavelet transformed time–frequency map in the frequency direction, while filtering out some wavelet coefficients of small amplitude in the process of squeezing according to the set threshold, which has strong robustness to noise. At present, there are few studies on the application of SWT in water quality data denoising, and its denoising effect on water quality series is worthy of further study (Xue et al. 2018).

To sum up, the goal of this paper is: (1) using SWT to denoise the original data and compare the performance with peer denoising algorithms; (2) constructing the SSA-DELM model and determining the optimal hyperparameters; (3) integrating the SWT-SSA-DELM model and verifying its effectiveness with examples.

According to literature, there are few studies that apply DELM to water quality parameter prediction; therefore, it is worth exploring and analyzing the prediction performance for the improved DELM. Comparing to existing models, our novel method makes the following major contributions to the literature: (1) a new data cleansing technique is established for water quality time series; (2) a novel optimal-hybrid model combining SSA and DELM with high accuracy is proposed. The rest of the paper is organized as follows. In the section ‘Materials and methods,’ the SWT, DELM, and SSA are introduced and the SWT-SSA-DELM hybrid model is proposed. The ‘Results’ and ‘Discussion’ sections contain the results and discussion of SWT-SSA-DELM in practical application. Section ‘Conclusion’ concludes the study and proposes future research plans.

Materials and methods

Data description

The Yellow River Basin is located between 32°N–42°N and 96°E–119°E, with a length of about 1900 km from east to west and a width of about 1100 km from north to south (Fig. 1). The Yellow River is the mother river of the Chinese nation, with a total length of 5464 km, which is the second largest river in China. It originates from the Bayan Har Mountain on the Qinghai Tibet Plateau and flows through nine provinces (regions) including Qinghai, Sichuan, Gansu, Ningxia, Inner Mongolia, Shaanxi, Shanxi, Henan, and Shandong, covering a total area of 7.95 × 105 km2.

Fig. 1
figure 1

Gauging stations in the Yellow River Basin

The data sets used were collected at the Xinchengqiao and Xiaolangdi gauging stations, from 2010 to 2016 with a week’s interval, which spans a period of 365 weeks. DO was selected as a predictor in the monitoring index.

The monitoring data of DO at Xinchengqiao and Xiaolangdi are depicted in Fig. 2. The statistical analysis results (Table 1) of the original time series of DO concentration show that the data have a wide range of variation, and the standard deviation (SD) implies a high deviation of data from the mean value.

Fig. 2
figure 2

The recorded data at Xinchengqiao and Xiaolangdi

Table 1 Statistical characteristics of the data used

SWT

Daubechies et al. (2011) proposed the SWT method by combining the synchrosqueezed technology with the continuous wavelet transform (CWT), which improves the time–frequency resolution by compressing and rearranging the CWT transform coefficients in the frequency direction.

For any sequence \(s\left(t\right)\in {L}^{2}\left(R\right)\), its CWT is expressed as:

$$W_s(a,b)=\left\langle s,\psi_{a,b}\right\rangle=\int s(t)a^{-1/2}\overline{\psi\left(\frac{t-b}a\right)}dt$$
(1)

where Ws(a, b) is the wavelet transform coefficient spectrum, a and b are the scale and displacement variables, respectively, and ψ(·) is the wavelet basis function.

According to Plancherel theorem, Ws(a, b) is rewritten as CWT of signal s in frequency domain:

$${W}_{s}\left(a,b\right)=\frac{1}{2\uppi }\int \widehat{s}\left(\xi \right){a}^{1/2}\overline{\widehat{\psi }\left(a\xi \right)}{\mathrm{e}}^{ib\xi }d\xi$$
(2)

where ξ is the angular frequency and \(\widehat{s}\left(\xi \right)\) is the Fourier transform of the signal s. When the wavelet transform coefficient of any point (a, b) is not equal to 0, the instantaneous frequency ωs(a, b) of signal s is:

$${\omega }_{s}\left(a,b\right)=-i{\left({W}_{s}\left(a,b\right)\right)}^{-1}\frac{\partial }{\partial b}{W}_{s}\left(a,b\right)$$
(3)

Then, the mapping from starting point (b, a) to (b, ωs(a, b)) is established by synchrosqueezed transformation, and Ws(a, b) is transformed from the time-scale plane to the time–frequency scale plane to obtain the new time–frequency spectral map.

In practical application, the frequency variable ω, the scale variable a, and the displacement variable b need to be discretized first. When the frequency ω and scale a are discrete variables, Ws(a, b) can be obtained only when ak-ak-1 = (Δa)k is satisfied at the point ak, and its corresponding synchrosqueezed transformation Ts(a, b) can also be calculated accurately only when the continuous interval [ω-1/2Δω, ω + 1/2Δω] is satisfied. Finally, the expressions of the synchrosqueezed wavelet transform are obtained by combining different conditions:

$${T}_{s}\left({\omega }_{\mathcal{l}},b\right)=\Delta {\omega }^{-1}\sum_{{a}_{k}:\left|\omega \left({a}_{k},b\right)-{\omega }_{\mathcal{l}}\right|}{W}_{s}\left({a}_{k},b\right){a}_{k}^{-\frac{3}{2}}{\left(\Delta a\right)}_{k}$$
(4)

After synchrosqueezing, the signal can still be reconstructed:

$$\begin{array}{c}\int_0^\infty W_s{\left(a,b\right)a}^{-3/2}da\;=\frac1{2\pi}\int_{-\infty}^\infty\int_0^\infty\widehat s\left(\xi\right)\overline{\widehat\psi\left(a\xi\right)}e^{ib\xi}a^{-1}dad\xi\\\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;=\frac1{2\pi}\int_0^\infty\int_0^\infty\widehat s\left(\xi\right)\overline{\widehat\psi\left(a\xi\right)}e^{ib\xi}a^{-1}dad\xi\\\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;=\int_{-\infty}^\infty\overline{\widehat\psi\left(a\xi\right)}\frac{d\xi}\xi\cdot\frac1{2\pi}\int_0^\infty{\widehat s\left(\xi\right)e}^{ib\xi}d\xi\end{array}$$
(5)

The specific steps can be summarized as follows (Wang 2017):

  1. (1)

    Obtain the water quality sequence data, perform CWT on the sequence s, and obtain its wavelet transform coefficients Ws in the frequency domain;

  2. (2)

    When the wavelet transform coefficient is not 0, the corresponding instantaneous frequency ωs can be obtained;

  3. (3)

    The instantaneous frequency ωs is transformed by synchrosqueezing to obtain its new time–frequency spectrogram.

ELM

Given P different data samples \(\left({x}_{j},{t}_{j}\right)\in {R}^{n}\times {R}^{m}\), for a single hidden layer feedforward neural network with N hidden layer nodes and an activation function of g(x), the output expression is as follows:

$$\sum_{i=1}^{N}{\beta }_{i}{g}_{i}\left({x}_{j}\right)=\sum_{i=1}^{N}{\beta }_{i}g\left({\alpha }_{i}{x}_{j}+{b}_{i}\right)={Y}_{j}$$
(6)

where bi represents the randomly generated hidden layer biases; αi denotes the weights connecting the input layer and the hidden layer; βi denotes the weights connecting the hidden layer and the output layer; and j = 1,2, …, N. If g(x) is infinitely differentiable, then the true output value of the input sample can be approximated with zero error, which can be expressed as:

$$\sum_{i=1}^{N}{\beta }_{i}g\left({\alpha }_{i}{x}_{j}+{b}_{j}\right)={t}_{j},j=\mathrm{1,2},\cdots ,N$$
(7)

and abbreviated as:

$$H\beta =T$$
(8)

where H is the hidden layer output matrix:

$$H={\left[\begin{array}{ccc}g\left({\alpha }_{1}{x}_{1}+{b}_{1}\right)& \cdots & g\left({\alpha }_{N}{x}_{1}+{b}_{N}\right)\\ \vdots & \ddots & \vdots \\ g\left({\alpha }_{1}{x}_{P}+{b}_{1}\right)& \cdots & g\left({\alpha }_{N}{x}_{P}+{b}_{N}\right)\end{array}\right]}_{P\times N}$$
(9)

The ELM algorithm steps are (Zounemat-Kermani et al. 2021):

  1. (1)

    The algorithm is given the training sample \(S=\left\{\left(x_j,t_j\right)\vert x_j\in R^n,t_j\in R^m,j=1,2,...,N\right\}\) and the hidden layer activation function g(x).

  2. (2)

    The number of hidden layer nodes N is set, and the algorithm randomly generates the input weights αi and biases bi.

  3. (3)

    The hidden layer output matrix H is calculated.

  4. (4)

    The output weight β is calculated, β = H*T, H* represents the Moore–Penrose generalized inverse matrix of H.

In order to improve the generalization ability and stability of the algorithm, Huang et al. (2006) added the parameter I/C on the basis of β. The formula is as follows:

$$\beta ={\left(I/C+{H}^{\mathrm{T}}H\right)}^{-1}{H}^{\mathrm{T}}Y$$
(10)

Hierarchical autoencoder

AE is a feedforward unsupervised neural network that uses a back-propagation algorithm to make the network output value equal to the network input value. The AE with a single hidden layer contains an encoder network and a decoder network. The structure of AE can also be seen as a simple three-layer neural network structure: an input layer, a hidden layer, and an output layer, where the output layer has the same network nodes as the input layer. Its purpose is to reconstruct the input signal to get the output signal after the coding and decoding process of the neural network, and the ideal situation is that the input signal is the same as the output signal. The specific network structure is shown in Fig. 3.

Fig. 3
figure 3

Network structure of the AE

The encoding process of the AE can be expressed as:

$$h=f\left(x\right)={s}_{f}\left({w}_{1}x+{b}_{1}\right)$$
(11)

The decoding process of the AE can be expressed as:

$$y=g\left(h\right)={s}_{g}\left({w}_{2}h+{b}_{2}\right)$$
(12)

where w1, b1 refer to the weights and biases of the encoder, and w2, b2 refer to the weights and biases of the decoder.

The activation function sf of encoder belongs to nonlinear transformation, and sigmoid, tanh, and rule functions are usually used. The activation function sg of the decoder generally adopts sigmoid function or constant function. So the network parameters of the autoencoder are θ = [w1, w2, b1, b2]. When the error between the input signal x and the reconstructed signal y is very small, it means that the neural network of the AE has been trained. The reconstruction error function L(x, y) is used to represent the error of x and y.

If the activation function sg of the decoder is a constant function, then the reconstruction error function can be expressed as:

$$L\left(x,y\right)={\Vert x-y\Vert }^{2}$$
(13)

If the activation function sg of the decoder is a sigmoid function, then the reconstruction error function can be expressed as:

$$L\left(x,y\right)=-\sum_{i=1}^{n}\left[{x}_{i}log{y}_{i}+\left(1-{x}_{i}\right)log\left(1-{y}_{i}\right)\right]$$
(14)

Let the training sample data set be \(S={\left\{{X}^{\left(i\right)}\right\}}_{i=1}^{N}\), then the loss function of AE can be expressed as:

$${J}_{AE}\left(\theta \right)=\sum_{x\in S}L\left(x,g\left(f\left(x\right)\right)\right)$$
(15)

For ELM, the input sample data equals the output sample data, which ensures that it can learn without supervision; the randomly generated weights α and biases b of the hidden layer nodes are orthogonal to each other, which is the basis of ELM-AE. In addition, for ELM-AE, the input data is mapped to the high-dimensional space via the generated orthogonal hidden layer parameters, which can be introduced according to the Johnson–Lindenstrauss lemma:

$$H=G\left(\alpha \times x+b\right),{\alpha }^{\mathrm{T}}\alpha =1,{b}^{\mathrm{T}}b=1$$
(16)

where α = [α1, α2,…, αN] is the orthogonal weight to connect the input layer and the hidden layer nodes, b = [b1, b2,…, bN] represents the orthogonal hidden layer node biases. The output weight β of ELM-AE performs a learning conversion from the feature space to the input data, and it can be obtained from Eq. (12).

$$\beta ={\left(I/C+{H}^{\mathrm{T}}H\right)}^{-1}{H}^{\mathrm{T}}x$$
(17)

DELM

DELM is a hierarchical neural network superimposed by ELM-AE, and its architecture is presented in Fig. 4. DELM does not need fine-tuning, and its hidden layer parameters are initialized by ELM-AE so that the parameters are orthogonal to each other. The output matrixes of the last hidden layer and output layer are calculated with the least square method. Each layer’s output is then taken as the input of the next layer in turn, and the stopping criterion is the set precision or operation time (Tissera and McDonnell 2016).

Fig. 4
figure 4

Network structure of DELM (a) The first layer weight matrix (b) The i + 1 layer weight matrix (c) The last layer weight matrix (Kasun et al. 2013)

The main execution steps of DELM can be described as follows:

  1. (1)

    The algorithm is given the training sample \(S=\left\{\left(x_j,t_j\right)\vert x_j\in R^n,t_j\in R^m,j=1,2,...,N\right\}\) and the activation function of the hidden layer g(x).

  2. (2)

    The network structure of DELM is set to make x = y; the hidden layer is stratified by the ELM-AE so that the parameters are orthogonal to each other.

  3. (3)

    The hidden layer output matrix H is calculated.

  4. (4)

    The output weight β is calculated with Formula (17).

  5. (5)

    Steps (3) and (4) are repeated if 1 ≤ i ≤ y-1, the output matrix of the hidden layer of the ith layer is calculated.

  6. (6)

    If i = y, the output matrix of the last hidden layer is determined by the least square method.

SSA

In ELM, the selection of input weights α and biases b of each ELM-AE pre-trained by DELM significantly influences the stability and accuracy of DELM. Therefore, SSA is adopted to assist in optimizing the input weights and biases of DELM.

The sparrow search algorithm is inspired by the predatory and anti-predatory behavior of sparrows in biology (Xue and Shen 2020). The sparrow set matrix is as follows.

$$X={\left[{x}_{1},{x}_{2}\cdots {x}_{n}\right]}^{\mathrm{T}},{x}_{i}=\left[{x}_{i,1},{x}_{i,2}\cdots {x}_{i,d}\right]$$
(18)

where n is the number of sparrows, i = (1, 2, … n), d is the dimension of variables.

The matrix of fitness values for sparrows is expressed as follows.

$${F}_{x}={\left[f\left({x}_{1}\right),f\left({x}_{2}\right)\cdots f\left({x}_{n}\right)\right]}^{\mathrm{T}}$$
(19)
$$f\left({x}_{i}\right)=\left[f\left({x}_{i,1}\right),f\left({x}_{i,2}\right)\cdots f\left({x}_{i,d}\right)\right]$$
(20)

where each value in Fx indicates the fitness value of the individual. Sparrows with better fitness values are given priority to obtain food and act as producers to lead the whole population towards the food source. The position of the producer is updated in the following way (Mao et al. 2021).

$${X}_{i,j}^{t+1}=\left\{\begin{array}{c}{X}_{i,j}^{t}\cdot exp\left(\frac{-i}{\alpha \cdot {\text{iter}}_{\mathrm{max}}}\right){R}_{2}<ST\\ {X}_{i,j}^{t}+Q\cdot L{R}_{2}\ge ST\end{array}\right.$$
(21)

where t denotes the number of current iterations, j = (1, 2, … d), \({X}_{i,j}^{t}\) denotes the position of the ith sparrow in the jth dimension. itermax denotes the maximum number of iterations, α belongs to a random number in the range (0, 1), and \({R}_{2}\left({R}_{2}\in \left[0,1\right]\right)\) and \(ST\left(ST\in \left[0.5,1\right]\right)\) represent the warning value and the safety value, respectively. Q is a random number obeying a normal distribution of [0,1]. L denotes a matrix of 1 × d and each element of the matrix is 1. When R2 < ST, this indicates that there are no natural predators around and the finder is able to conduct a wide search for food. If R2 ≥ ST, it means that some sparrows have found natural enemies and the whole population needs to fly quickly to other safe areas.

Except for the producers, the remaining are all scroungers, and the position update formula of the scroungers is as follows.

$${X}_{i,j}^{t+1}=\left\{\begin{array}{cc}Q\cdot exp\left(\frac{{X}_{\text{worst}}-{X}_{i,j}^{t}}{{i}^{2}}\right)& i>n/2\\ {X}_{p}^{t+1}+\left|{X}_{i,j}^{t}-{X}_{P}^{t+1}\right|\cdot {A}^{+}\cdot L& {\text{otherwise}}\end{array}\right.$$
(22)

where Xworst denotes the global worst position, A denotes a 1 × d matrix, each element in the matrix is assigned a random value of 1 or -1, and A+ = AT(AAT)−1. When i > n/2, it means that the ith scrounger with a poor fitness value does not get food and has a very low energy value, so it needs to fly to other places to forage for more energy.

When the population is foraging, a small number of sparrows will be selected for vigilance, and when a natural enemy appears, either the spotter or the scrounger will abandon the current food and fly to a new location. SD (generally taken as 10%–20%) sparrows are randomly selected from the population for early warning behavior each generation. The location updating formula is as follows:

$${X}_{i,j}^{t+1}=\left\{\begin{array}{cc}{X}_{\text{best}}^{t}+\beta \cdot \left|{X}_{i,j}^{t}-{X}_{\text{best}}^{t}\right|& {f}_{i}>{f}_{g}\\ {X}_{i,j}^{t}+K\cdot \left(\frac{\left|{X}_{i,j}^{t}-{X}_{\text{worst}}^{t}\right|}{\left({f}_{i}-{f}_{w}\right)+\varepsilon }\right)& {f}_{i}={f}_{g}\end{array}\right.$$
(23)

where Xbest denotes the global best position, β is the step control parameter, a normally distributed random number with mean 0 and variance 1, and k is a uniform random number in the range of [-1,1]. Here, fi is the current fitness value of the sparrow. fg and fw are the current global best and worst fitness values, respectively. ε is the minimum constant to avoid 0 in the denominator. When fi > fg, it means that the sparrow is at the edge of the population and is highly vulnerable to natural predators; fi = fg indicates that the sparrow in the middle of the population is aware of the danger of being attacked by natural predators and needs to move closer to other sparrows. k refers to the moving direction of the sparrow, i.e., the step control parameter.

Proposed model descriptions

The architecture of the SWT-SSA-DELM model is shown in Fig. 5 and is implemented through the following three steps.

  1. (1)

    Based on SWT technology for noise reduction of the original time series, the core technology can be outlined as continuous wavelet transform, instantaneous frequency acquisition and compression and recombination.

  2. (2)

    SSA was introduced into the DELM model to integrate the SSA-DELM hybrid model.

  3. (3)

    Build the interface between noise reduction data and hybrid prediction model, analyze and verify the effectiveness of SWT-SSA-DELM model through case studies.

Fig. 5
figure 5

Architecture of the SWT-SSA-DELM

Benchmark models

In this paper, models such as ELM and DELM alone, as well as their improved form based on ensemble learning, LSTM, back propagation neural network (BPNN), support vector regression machine (SVR), ARIMA were adopted as comparison models.

LSTM

Recurrent neural network (RNN) refers to the introduction of the concept of time series in the design of network structure. The concept of time series in the network structure design makes it more suitable for the analysis of time series data. However, traditional RNNs suffer from the problem of gradient disappearance with increase in time interval. To solve this long dependence problem, Hochreiter and Schmidhuber (1997) proposed LSTM based on RNN. The LSTM network is a RNN that modifies the structure of the memory unit by transforming the tanh layer in the traditional RNN to a structure that includes a memory unit and a gate mechanism (Song et al. 2021a, b). This mechanism determines how the information in the memory unit should be used and updated, thus alleviating the problem of gradient spreading and explosion.

BPNN

The BP neural network model consists of an input layer, an output layer, and several hidden layers, and the output of the previous layer is the input of the next layer. The network training process is divided into forward propagation and backward propagation (Jia and Zhou 2020). Given an input, it is propagated from the input layer to the hidden layer and then to the output layer for processing by forward propagation, while backward propagation is used to adjust the weights and biases of the model based on error feedback. This calculation ends when the error of the network output is reduced to a set criterion or the training count is reached.

SVR

Support vector machine (SVM) is a machine learning algorithm based on the theory of statistical VC dimension and the theory of structural risk minimization. The core idea is to find the partitioned hyperplane with the maximum interval of heterogeneous support vectors in the data space to serve as the optimal decision boundary. When SVM is applied to the regression problem, it is known as SVR, which uses the hyperplane decision boundary in support vector classification as the regression model.

ARIMA

ARIMA model is one of the methods of time domain analysis of time series. It mainly includes AR model, MA model, ARIMA model, autoregressive conditional heteroskedasticity model (ARCH), and its derivative models. ARIMA model is a method developed on the basis of AR model and MA model, which can smooth the non-stationary time series by difference method, and can overcome the shortcomings of AR model and MA model which can only deal with stationary time series. At the same time, this model has the characteristics of simple structure and few input variables compared with ARCH model and its derivatives, which is one of the most widely used methods for univariate time series data forecasting.

Parameter specification

The selected monitoring data can be abnormal and missing due to instrument failure, signal interference, etc. The original data were first detected for outliers, and the criterion for determining this was the laida criterion (Zhou 2020), the outliers were considered as missing values, and then the missing values were filled using the Lagrange interpolation method.

In this paper, the original data is divided into training and test sets in the ratio of 7:3. The sliding window length is 10, that is, the data of the first 9 weeks is used to predict the next one. Based on experience, the data needs to be normalized for better and faster model training:

$${x}_{i}^{^{\prime}}=\frac{{x}_{i}-{x}_{imin}}{{x}_{imax}-{x}_{imin}}$$
(24)

where the ith variable is represented by xi; ximin and ximax represent the minimum and maximum values of xi in the training set, respectively.

After the training process of the model is completed, the output of the normalized model is denormalized to the actual values.

$${x}_{i}=\left[{x}_{i}^{^{\prime}}\left({x}_{imax}-{x}_{imin}\right)\right]+{x}_{imin}$$
(25)

After extensive literature search (Farzin and Valikhan Anaraki 2021) and trial-and-error calculations, the parameters of the applied model in this paper are determined. Among them, the DELM model parameters are set as follows: the activation function is sigmoid, 30 hidden layers are set, and the number of nodes in each hidden layer is 20. The SSA parameters for optimizing DELM are set as follows: set the number of sparrows to 10, up to 100 iterations, and the proportion of producers to 20% initially. In this paper, the prediction results of SWT-SSA-DELM are compared with those of some basic algorithms and their improved versions. The base ELM model sets 10 hidden layer nodes, the kernel function of both DELM and SVR is set to radial basis kernel function, SVR generates hyperparameters by fivefold cross-validation, where the kernel parameter g and penalty parameter C both take values in the interval [-10, 10]. The BPNN is a single implied layer structure, with 10 hidden layer nodes, learning rate set to 0.1, maximum 1000 iterations, and target accuracy set to 1e-3.

The initial parameters of the standalone LSTM algorithm are L1 = 20, L2 = 20, Lr = 0.005, and K = 20, where L1 and L2 refer to the number of LSTM units in the first layer and the number of LSTM units in the second layer, Lr refers to the learning rate, and K refers to the number of training iterations. For the three indicators in the ARIMA (p, d, q) structure, based on the Akaike information criterion (AIC) and Bayesian information criterion (BIC), the p and q that minimize the sum of AIC and BIC are used as the parameters of the ARIMA model. In this paper, the range of p and q values is 1 ~ 5, and the range of d values is 0 ~ 2, where p is the autoregressive order, q is the moving average order, and d is the number of differences. For the combined models with ELM, DELM as the substrate, the parameter settings are kept consistent with those of the subcomponents. In addition, the best root mean square error (RMSE) is selected as the fitness function of the applied models.

Performance metrics

The selection of reasonable performance criteria is of utmost importance for assessing model performance. However, there is no universal standard index for evaluating the effectiveness of prediction models. In this paper, the Nash–Sutcliffe efficiency coefficient (NSE), mean absolute error (MAE), RMSE, mean relative error (MAPE), index of agreement (IA), correlation coefficient (CC), determination coefficient (R2), and the width of uncertainty band for the 95% confidence level (± 1.96Se) (Alizadeh et al. 2018) are adopted to statistically evaluate the models applied in this paper. The mathematical representations are cited as follows:

$$MAE=\frac{1}{n}\sum_{i=1}^{n}\left|{\widehat{DO}}_{i}-{DO}_{i}\right|$$
(26)
$${\text{MAPE}}=\frac{1}{n}\sum_{i=1}^{n}\left|\frac{{\widehat{DO}}_{i}-{DO}_{i}}{{DO}_{i}}\right|\times 100\%$$
(27)
$${\text{RMSE}}=\sqrt{\frac{1}{n}\sum_{i=1}^{n}{\left({\widehat{DO}}_{i}-{DO}_{i}\right)}^{2}}$$
(28)
$${R}^{2}=\frac{\sum\limits_{i=1}^{n}{\left({\widehat{DO}}_{i}-\overline{DO }\right)}^{2}}{\sum\limits_{i=1}^{n}{\left({DO}_{i}-\overline{{DO }_{i}}\right)}^{2}}$$
(29)
$$CC=\frac{\sum\limits_{i=1}^{n}\left({\widehat{DO}}_{i}-\overline{{\widehat{DO} }_{i}}\right)\left({DO}_{i}-\overline{{DO }_{i}}\right)}{\sqrt{\sum\limits_{i=1}^{n}{\left({\widehat{DO}}_{i}-\overline{{DO }_{i}}\right)}^{2}}\sqrt{\sum\limits_{i=1}^{n}{\left({DO}_{i}-\overline{{DO }_{i}}\right)}^{2}}}$$
(30)
$$NSE=1-\frac{\sum\limits_{i=1}^{n}{\left({DO}_{i}-{\widehat{DO}}_{i}\right)}^{2}}{\sum\limits_{i=1}^{n}{\left({DO}_{i}-\overline{{DO }_{i}}\right)}^{2}}$$
(31)
$$IA=1-\frac{\sum\limits_{i=1}^{n}{\left({\widehat{DO}}_{i}-{DO}_{i}\right)}^{2}}{\sum\limits_{i=1}^{n}\left(\left|{\widehat{DO}}_{i}-\overline{{DO }_{i}}\right|+\left|{DO}_{i}-\overline{{DO }_{i}}\right|\right)}$$
(32)
$$\left\{\begin{array}{c}\overline{e }=\frac{1}{n}\sum\limits_{i=1}^{n}\left({\widehat{DO}}_{i}-{DO}_{i}\right)\\ {S}_{e}=\sqrt{\sum\limits_{i=1}^{n}{\left({\widehat{DO}}_{i}-{DO}_{i}-\overline{e }\right)}^{2}/\left(n-1\right)}\end{array}\right.$$
(33)

where n is the prediction data size, \({DO}_{i}\) and \({\widehat{DO}}_{i}\) represent the measured and predicted values of DO concentration at time i, and \(\overline{DO }\) denotes the mean value of the actual DO concentration.

In general, the smaller the value of MAE, MAPE, and RMSE, the smaller the error with the actual value. The MAPE-based model performance can be classified into four classes: ‘excellent,’ ‘good,’ ‘reasonable,’ ‘inaccurate,’ with the corresponding MAPE values ranging from < 10%, 10%–20%, 20%–50%, and > 50%, respectively (Lu and Ma 2020). The range of NSE is (-∞,1], NSE = 1 indicates the excellent correspondence between the predicted data and the measured data, and the closer the NSE is to 1, the better the quality of the prediction results. The IA can range from 0 to 1; larger values of both IA and Se represent better prediction performance of the model.

Results

Although the data sets of both Xinchengqiao and Xiaolangdi monitoring points are calculated in this paper, in order to avoid repetition, the main body only focuses on the Xinchengqiao. The detailed introduction of the prediction results of Xiaolangdi is shown in Table 3.

Sensitivity analysis

The values of the parameters of the models applied in this paper have been given in the subsection ‘Parameter specification,’ and this subsection mainly focuses on the details of the parameter settings of the proposed model SWT-SSA-DELM. The parameters of SWT-SSA-DELM mainly include two parts, SSA and DELM, where the parameters of SSA are referred to the literature (Song et al. 2021a, b), and the number of iterations is increased to 100. The important parameters of DELM include hidden layer size or number of hidden layers, and number of nodes in each hidden layer, which are determined by trial and error. In order to find the best values of the number of hidden layers and nodes in each hidden layer, the number of hidden layers is varied from 10 to 60 at intervals of 10, and the number of nodes in each hidden layer is varied from 5 to 30 at intervals of 5. The variation surface of RMSE was obtained by interpolation, as shown in Fig. 6. It can be seen that, for the sample data in Xinchengqiao, the best value of RMSE (0.171) is related to the number of hidden layers and nodes in each hidden layer with values of 30 and 20, respectively, and the RMSE increases by the distance from these values.

Fig. 6
figure 6

The RMSE variation based on the number of hidden layers and nodes in each hidden layer

Evaluation of noise reduction effect

The time–frequency plots of the raw time series before and after noise reduction are given in Fig. 7. Through local amplification, the denoised data is greatly reduced in the high-frequency range, which is the performance of high-frequency noise component being kicked out.

Fig. 7
figure 7

Time–frequency plots of the raw data (a) unprocessed and (b) denoised

Figure 8(a) is a comparison diagram of the fast Fourier transform (FFT) spectrum with or without SWT noise reduction. It can be found that as far as the morphology is concerned, the spectral curves of the data with and without SWT processing do not differ much. However, based on the amplified observation of the high-frequency part of the curve, it can be seen that after noise reduction, the distribution of the data on the high-frequency component is significantly weakened, and the noise above 13 Hz basically disappears, and the noise reduction performance is good. Figure 8(b) presents the time series of water quality with and without SWT treatment. Echoing Fig. 8(a), the water quality data after the noise reduction process is smoother and more rounded, and to a certain extent, the outliers in the data are removed.

Fig. 8
figure 8

Noise reduction results based on SWT. (a) FFT spectrum (b) Water quality time series

To test the noise reduction effect of SWT on water quality signals, the results are compared with those obtained by combining EMD, EEMD, CEEMD, CEEMDAN, and VMD with FE, respectively. The basic mode of the denoising algorithm for comparison is to remove the first two terms with higher FE in the decomposed components, and then reconstruct the remaining components. This paper adopts three commonly used noise reduction effect evaluation criteria, namely RMSE, signal-to-noise ratio Rsn (Song et al. 2021a, b), CC, residual energy percentage Esn, and sequence smoothness (Davies 1994). A good noise reduction model usually corresponds to smaller RMSE, larger Rsn, CC, Esn, and smoothness.

Table 2 shows the noise reduction effect of different methods. According to Table 2, the performance of SWT is the best among similar noise reduction models in four evaluation metrics: RMSE, Rsn, CC, and smoothness, and the performance of SWT is second only to CEEMD-FE in Esn.

Table 2 Noise reduction performance of applied algorithms

Figure 9 shows the reconstructed sequence errors of various noise reduction methods, including EMD-FE, EEMD-FE, CEEMD-FE, CEEMDAN-FE, and VMD-FE. The amplitude of sequence error after SWT noise reduction fluctuates in [-0.73,0.60], and its noise reduction performance clearly outperforms the remaining five models. The results are basically consistent with Table 2.

Fig. 9
figure 9

Reconstructed sequence errors of different noise reduction methods

Evaluation of predictive performance

The evolution curve of the fitness value of the applied algorithm is depicted in Fig. 10, where the X-axis represents the number of iterations and the Y-axis represents the fitness value. As the algorithm iterated, the error gradually converged, and the model with SWT preprocessing converged more rapidly. The error convergence curve of the SWT-SSA-DELM model achieves the target accuracy within 10 iterations, and is able to remain in the optimal state after that, which also indicates the superior learning and training ability of the proposed model. It is worth noting that the initial fitness value of the model pre-treated by SWT is lower, which also indirectly verifies the effectiveness of SWT.

Fig. 10
figure 10

Error convergent curve with iteration

Figure 11(a) shows the frequency distribution histogram of applied algorithms. The prediction errors of the hybrid SWT-SSA-DELM proposed in this paper are distributed around the zero point, i.e., the probability density of the interval with smaller errors is larger and has higher prediction accuracy. The ELM, DELM, SSA-ELM, and SSA-DELM models have a wider range of error distribution and higher probability density at larger errors, and the proposed models have better prediction accuracy than the SWT-ELM, SWT-DELM, and SWT-SSA-ELM models.

Fig. 11
figure 11

Error statistics of applied algorithms: (a) Frequency distribution histogram and (b) MAPE plot

Figure 11(b) shows the MAPE statistics of applied algorithms. The results show that the mean values of MAPE of the eight methods are less than 10%, which indicates that these models can meet the needs of general water quality prediction accuracy. Among similar models, SWT-SSA-DELM has the smallest MAPE value and even the outliers do not exceed the 10% threshold, and compared with standalone ELM, the MAPE values of SWT-SSA-DELM have decreased by 78.54%, indicating its excellent predictive performance.

To further test and evaluate the feasibility and superiority of the SWT-SSA-DELM, LSTM, ARIMA, SVR, and BPNN are used as the benchmark algorithms in this paper.

The Taylor plot corresponding to the predicted results of each model is presented in Fig. 12. The root mean square deviation (RMSD) values are sorted as follows: SWT-SSA-DELM < SWT-DELM < SSA-DELM < DELM < BPNN < SVR < LSTM < ARIMA; the CC are sorted as follows: SWT-SSA-DELM > SWT-DELM > SSA-DELM > DELM > BPNN > SVR > LSTM > ARIMA. Considering the multiple error evaluation indicators, the SWT-SSA-DELM model has the most superior predictive capability, which can meet the requirements of water quality prediction in the Yellow River Basin.

Fig. 12
figure 12

Taylor plot comparing the performance of applied models

The correlation between the measured and predicted DO concentrations of applied algorithms is presented in Fig. 13. The black line and red line represent the y = x reference line and regression line, respectively. It can be found that the SWT-SSA-DELM model has the highest fitting degree between the predicted value and the measured value, and the CC value and R2 can reach 0.983 and 0.966, respectively, which are significantly higher than other models. Compared with DELM, which is only stacked by layered autoencoders, the error statistics results of SSA-DELM verify the importance of parameter optimization based on SSA, and through further comparison with SWT-DELM, the necessity of SWT preprocessing is confirmed.

Fig. 13
figure 13

Correlation between the measured and predicted DO concentrations

Violin plot is used to display the distribution status and probability density of multiple sets of data. This kind of graph combines the characteristics of the box plot and the density graph, and is mainly used to show the distribution shape of the data. Figure 14(a) shows the violin plot of applied algorithms for model performance. The violin plot patterns of the prediction results of SWT-SSA-DELM are most similar to those of the original measurement data, and the median and frequency distributions are closest to the true values, indicating the best prediction performance. And according to the prediction results of DO, the water quality of two monitoring stations is good according to the Chinese water quality classification standard.

Fig. 14
figure 14

Prediction results of applied algorithms: (a) Violin plot and (b) Prediction curve

The prediction curves of various methods are presented in Fig. 14(b). The predicted result curves of the algorithms applied in this paper are generally consistent in terms of trend, but there are some differences in details. The prediction results of SWT-SSA-DELM are the closest to the measured values, and the predicted results are largely distributed within the 95% confidence interval of the actual values, indicating a relatively accurate prediction.

Table 3 summarizes the prediction results of different models. The performance of SWT-SSA-DELM has been dramatically improved. Compared with the basic ELM, the SWT-SSA-DELM model reduced by 79.05%, 78.54%, 80.23%, and 77.64% in terms of MAE, MAPE, RMSE, and width of uncertainty band, and increased by 210.33%, 76.16%, 792.56%, and 49.02% in terms of R2, CC, NSE, and IA, respectively. In addition, its improvement for the standalone ELM model was higher than that of the other models, and its error performance was also better than some new improved ELM algorithms (Yu et al. 2018; Liu et al. 2020). Moreover, the predicted water quality of Xiaolangdi hydrological station has the same pattern as that of Xinchengqiao, indicating the stability of the proposed algorithm.

Table 3 Summary of error statistics of applied models

Discussion

Summarizing the above research results, the proposed SWT-SSA-DELM model outperforms other models in terms of prediction performance. It is worth mentioning that the SWT-ELM and SSA-ELM correspond to the ELM processed by data cleaning and meta-inspired optimization algorithm, respectively, and the former has better prediction performance than the latter. Based on the monitoring data of Xinchengqiao, SWT-ELM is 49.38%, 51.22%, 42.01%, and 42.37% lower in MAE, MAPE, RMSE, 1.96Se, respectively, and 32.59%, 15.12%, 34.70%, and 8.75% higher in R2, CC, NSE, and IA, respectively, compared with SSA-ELM. Furthermore, the forecasting results applied at Xinchengqiao for both models are consistent with those at Xiaolangdi station, which also indirectly indicates the importance of data cleaning. Currently, a lot of research work has been invested in the integration of meta-inspired optimization algorithms with basis functions, while the attention to data cleaning, i.e., data preparation, has yet to be increased.

SWT is mainly applied to noise reduction processing of chaotic signals in the fields of microseismic signals, wind power, lightning electric field signals, etc. The chaotic signals in these fields usually have huge data volume and can give full play to the noise reduction performance of SWT. In this paper, we choose the weekly water quality monitoring data of the Yellow River basin; although the monitoring time span reaches 7 years, the data volume of monitoring indicators is only 365, which is much smaller than the data magnitude of the data of other fields mentioned above. Therefore, in future research, a data set with greater monitoring frequency or longer monitoring period will be considered to ensure the quality of the training set. Furthermore, limited by the completeness of the data, only one monitoring indicator, DO, was selected as a predictor in this paper. In order to better verify the robustness and adaptability of the proposed model, multiple monitoring indicators from multiple regions should be selected to participate in the prediction in future studies. In addition, two-phase denoising techniques, i.e., singular spectrum decomposition, CEEMDAN, VMD, and other decomposition techniques combined with SWT, are worth exploring for their noise reduction effectiveness.

In this paper, the sliding window algorithm is employed to implement single-step forward scrolling prediction. Both the sliding window size and the specific forward prediction step have an impact on the model prediction performance; therefore, the determination of these two parameters and the detailed impact on the model error need to be further explored.

The hyperparameters of DELM are optimally selected using SSA in this paper, and although SSA is of great use in this process, it is not without its drawbacks. SSA alone is prone to fall into local extremes in later iterations due to the reduction of population diversity. However, few attempts have been made to optimize and improve the SSA, and the performance of the improved SSA deserves to be evaluated and explored in future studies.

Conclusion

Highly accurate water quality prediction model has a pivotal role in water resources allocation, scheduling, and other decision making. In this paper, an improved hybrid DELM model (SWT-SSA-DELM) for water quality prediction was proposed and tested with hydrological monitoring data, from 2010 to 2016, measured at Xinchengqiao and Xiaolangdi, Yellow River Basin, China. The statistical parameters MAE, MAPE, RMSE, R2, CC, NSE, IA, and the width of uncertainty band for the 95% confidence level were used to assess the performance of the model. The standalone ELM and DELM, as well as their improved form after ensemble learning, LSTM, ARIMA, SVR, and BPNN were used as comparative prediction algorithms.

The statistical results indicated that the water quality prediction model based on SWT-SSA-DELM outperforms peer models. The DELM can extract the in-depth information in the water quality time series, which eliminates the defect that the ordinary ELM model is difficult to capture the high correlation characteristics due to the single hidden layer structure, dramatically improving the accuracy of the prediction model.

Although the performance of the model put forward in this paper is acceptable, there is still room for improvement in its performance: (1) choose a more complete data set, and two-phase denoising techniques can be used to study the noise reduction performance; (2) study the effects of the sliding window size and the forward prediction step on prediction accuracy; (3) develop and improve the existing SSA algorithm, one idea is to use hybrid sine cosine algorithm and Lévy flight algorithm to avoid SSA falling into local optimum.