Introduction

Sustainable water management entails using water in a way that fulfills present human needs while also ensuring its availability for future generations (Azizi et al. 2022; Nova 2023). It requires careful consideration of three aspects, including economic efficiency, social quality, and environmental preservation (Jahanshahi and Kerachian 2019; Jahanshahi et al. 2023). The concept of sustainable development is intricately tied to water quality management. This integration is explicitly embraced within the sustainable development framework (Chapman and Sullivan 2022). In addition to playing an important role in environmental processes, water quality also has a profound effect on human and animal health. Water contamination can cause serious health problems, including gastrointestinal diseases, skin diseases, and cancer, among others (Lin et al. 2022; Babuji et al. 2023). Therefore, monitoring the quality of water is a fundamental component of effectively managing water resources (Sinshaw et al. 2019; Uddin et al. 2021). Evaluating water quality involves the monitoring of diverse variables, with the resulting data finding application across multiple domains. For instance, the analysis of surface and groundwater quality through spatiotemporal data (Yan et al. 2022; Zamani et al. 2023b; Uddin et al. 2023a, 2023b, 2023c) aids in the identification of emerging contaminants, thereby facilitating the formulation of effective strategies for safeguarding and conserving environments (Nadiri et al. 2022). Additionally, WQVs can be integral in optimizing reservoir operations and water allocation (Nikoo et al. 2013), as well as in establishing dedicated networks for water quality monitoring (Jiang et al. 2020).

Globally, one of the noteworthy concerns considering water quality is the eutrophication of reservoirs and rivers, which poses significant economic, social, and environmental risks (Vinçon-Leite and Casenave 2019). This phenomenon entails water bodies becoming enriched with excessive plant nutrients, primarily phosphorous and nitrogen. The proliferation of these nutrients leads to an accelerated growth of algae, significantly deteriorating the quality of water across diverse sources (Boyd 2020; Uddin et al. 2022b). Chl-a is one of the commonly employed indicators for assessing eutrophication, offering the potential to supply an approximation of the algae concentration in a water body (Zhang et al. 2019). Thus, monitoring the levels of Chl-a using AI tools, as an innovative and efficient approach, can help manage water bodies more sustainably (Shin et al. 2020; Barzegar et al. 2020). Numerous research studies are available within the scope. For example, Li et al. (2017) utilized hybrid ANNs to forecast Chl-a levels, as a representative index of eutrophication, in several lakes in China. Guo et al. (2018) assessed the condition related to the nutrient status of a water supply reservoir by analyzing Chl-a data during different seasons. Papenfus et al. (2020) investigated the feasibility of utilizing spectral bands within remote sensing data to monitor Chl-a levels in lakes and reservoirs located in the USA by analyzing both in situ measurements and satellite imagery on a nationwide scale.

ML and its subset, DL, have transformed the field of AI with their applications. Such approaches have increasingly gained popularity in the field of water resources such as in hydrology (Khosravi et al. 2021; Meydani et al. 2022). Most traditional knowledge-driven methods are laborious and susceptible to prediction errors since they presume linear relationships among variables (Ma et al. 2014). Unlike traditional models, ML and DL models excel in managing complex and non-linear systems due to their ability to operate without the need for an initial hypothesis regarding the data structure or underlying physical processes (Shen et al, 2018; Barzegar et al. 2019; Sahraei et al. 2021; Uddin et al. 2022a). Ewuzie et al. (2022) highlighted the increasing interest in the use of ML and DL, particularly DL, in the domain of modeling and predicting/forecasting water quality. Numerous studies have explored the capability of different AI models, encompassing ML and DL models, such as NB (Bhardwaj et al. 2022; Uddin et al. 2023a, 2023b, 2023c), RF (Sakaa et al. 2022; Virro et al. 2022; Zamani et al. 2023a), ANNs (Chen and Liu, 2014; Dawood et al. 2021; Wang et al. 2022; Georgescu et al. 2023; Farshbaf Aghajani et al. 2023), RNN (Cho et al., 2014; Qi et al. 2020; Prasad et al. 2022; Ni et al. 2023), BPNN (Barzegar and Asghari Moghaddam 2016; Chen et al. 2023b), ANFIS (Gaya et al. 2020; Rizal et al. 2023), SVM (Haghiabi et al 2018; Chou et al. 2018; Elkiran et al. 2019; El Bilali and Taleb 2020; Ortiz-Lopez et al. 2022; Uddin et al. 2023a, 2023b, 2023c), ELM (Zhu and Heddam 2020), MLR (Barzegar et al. 2017; Liu et al. 2019), CNN (Pyo et al. 2020; Chen et al. 2020b), LSTM (Li et al., 2022; Liang et al. 2020; Kouadri et al. 2021), GRU (Luo et al. 2019; Fu et al. 2021), and TCN (Chen et al. 2020a) for predicting/forecasting various WQVs, such as DO, salinity, fluoride, Chl-a (Choi et al. 2019), TDS, and EC (Li et al. 2019).

While individual ML/DL models can achieve satisfactory performance in predicting/forecasting WQVs, hybrid models combine diverse individual models to address limitations such as a restricted capacity to capture intricate patterns, limited generalization, challenges in handling noise and outliers, and constrained interpretability. These hybrid models harness the strengths of each singular model, including their simplicity, faster training, and robust interference capabilities (Barzegar et al. 2018; Chen and Dai, 2020; Bahrami et al. 2023). The main objective of an EM is to integrate standalone models in a way that enhances the precision of the outcomes of standalone models. The enhancement in the hybrid model is attained by maximizing the benefits of standalone models while minimizing their limitations (Rozinajová et al. 2018). In the past few years, hybrid procedures have found extensive application in the evaluation of water quality, modeling, and prediction/forecasting. Barzegar et al. (2018) investigated ELM and hybrid models designed to forecast EC values over multiple time steps and compared the results with the outcomes of an ANFIS and a wavelet-ANFIS model. Bui et al. (2020) explored four individual and twelve hybrid ML prediction models for WQVs in Iran. Barzegar et al. (2020) employed the CEEMDAN and VMD algorithms alongside ELM and LSSVM models to predict DO and Chl-a levels in the SPL in Greece. Cao et al. (2020) combined the GRU neural network and the K-means clustering method for the prediction of DO. Dehghani et al. (2021) developed four hybrid models by incorporating four metaheuristic optimization algorithms into SVR and observed improvement in their results. Alqahtani et al. (2022) presented individual ML algorithms alongside their amalgamation for the prediction of EC and TDS. This combination incorporated RF, GEP, and ANN. In a different light, Sakaa et al. (2022) formulated a hybridized model utilizing sequential minimal optimization and SVM to predict WQVs. Wu and Wang (2022) developed an ensemble model by integrating ANN and LSTM and their combination with wavelet transform for WQV prediction in a river in China. Moreover, Zamani et al. (2023a) proposed an innovative fusion model leveraging the Bayesian maximum entropy fusion approach to precisely estimate DO and Chl-a concentrations across various locations and depths within a reservoir in Oman.

Considering the existing literature, it is evident that various data-driven algorithms have been employed to accurately model WQVs. However, as far as the authors are concerned, the current literature still lacks a comprehensive comparative analysis of widely established DL forecasting techniques and their ensembling utilizing single- and multi-objective optimization structures — specifically, GA and NSGA-II — for WQV forecasting. Moreover, the integration of DL models holds the potential to enhance the precision of forecasting WQVs, especially in situations characterized by incomplete or insufficient observed data, such as short-period time series. Given these considerations, this study seeks to contribute to the realm of WQV (e.g., Chl-a) forecasting. The foundational framework for this modeling endeavor revolves around the training of four distinct DL models — namely, RNN, LSTM, GRU, and TCN — using Chl-a concentration data. Subsequently, these trained DL models are integrated through the application of MOO algorithms, resulting in the creation of EMs that is poised to forecast Chl-a concentration. Notably, this study marks the first endeavor to utilize an NSGA-II-based ensemble DL model for Chl-a forecasting. This study primarily aims to (1) comprehensively examine and compare the capabilities of diverse individual DL algorithms (e.g., RNN, LSTM, GRU, and TCN) in effectively forecasting Chl-a concentration 1 h ahead in SPL, Greece; (2) combine these individual DL models into a harmonized ensemble model, leveraging the strengths of each constituent model through the application of meta-heuristic optimization algorithms, including GA and NSGA-II; (3) execute a comparative analysis, contrasting the results generated by the individual DL models with those yielded by the Ems; (4) compare the results obtained from the EMs, thereby highlighting and elucidating the proficiency of NSGA-II in seamlessly integrating the distinct individual DL models. Forecasting the concentration of Chl-a in the lake 1 h ahead holds practical significance in environmental monitoring, water management, and ecological research. This predictive capability aids in detecting harmful algal blooms, guiding eutrophication assessments, facilitating timely responses to water quality changes, ensuring safe recreational activities, and optimizing aquaculture and fisheries practices. By providing insights into Chl-a dynamics, this forecasting enhances decision-making for ecosystem health, water safety, and sustainable resource management.

Methodology

The DL models are utilized to forecast the concentration of Chl-a in SPL in Greece. After assessing the effectiveness of these models using different statistical metrics, they are integrated into a hybrid model through the utilization of the GA and NSGA-II algorithms. The capability of these ensemble approaches to combine the results of diverse DL models is then compared to determine their effectiveness. The models are generated considering the standard practice for developing the DL forecasting model which is briefly reviewed in the following. This is a general framework suggested to ensure that the final hybrid model would work appropriately (Goodfellow et al. 2016; Chollet 2021; Géron 2022). Figure 1 depicts an illustrative depiction of the flowchart outlining the proposed methodology for the purpose of the Chl-a concentration. The framework comprises the subsequent stages.

Fig. 1
figure 1

The visual representation of suggested forecasting framework

Data collection and processing were initiated to assess water quality in this study. To obtain the required information, a multi-probe sensor CYCLOPS-7, developed by TURNER DESIGNS, was employed to measure various parameters in SPL. This step is shown in Fig. 1a.

The development of DL models involves several steps that ensure a robust and effective solution to a given problem. These steps are depicted in Fig. 1b. First, it is crucial to define and understand the problem, as it guides the selection of an appropriate DL model. In this study, the objective is to forecast short-term Chl-a concentrations using historical data (i.e., lag times of Chl-a). The evaluation metrics, such as RMSE and R2, are explained as they are selected to evaluate the models during the training and testing phases. Relevant data were collected and preprocessed, involving cleaning, handling missing data, data transformation, and feature extraction (i.e., input variable selection). Splitting the data is also of utmost importance to assess the performance of the model on data. It was done by separating the dataset into two primary sets: the training and testing datasets. Four models were selected for their ability to work with time-series data. Subsequently, the model underwent training using the preprocessed data through backpropagation to adjust the weights. Fine-tuning may be necessary to improve the model’s performance by adjusting hyperparameters, modifying the architecture, or adding regularization techniques. Finally, the models were tested on the validation dataset to guarantee its capacity to apply to novel data, ensuring generalizability. Achieving acceptable effectiveness on the testing dataset is a critical indication of the model’s capability to apply knowledge to unfamiliar data and its readiness for deployment in solving the given problem. To improve the results obtained from individual DL models, model ensembling strategies based on GA and NSGA-II were developed, as illustrated in Fig. 1c. The comparison was based on several evaluation indices and aimed to demonstrate the efficacy of the model ensembling approach.

RNN

An ANN known as a RNN falls within the realm of neural network models that can establish loops by connecting nodes, providing the ability for the output of one node to influence the output of the next node (Medsker and Jain 2001). By utilizing their internal state or memory, RNNs exhibit the capability to handle input signals of diverse lengths due to their temporal dynamic behavior, in contrast to feedforward neural networks (Carcano et al. 2008). RNNs are a type of network characterized by their infinite impulse response, while CNNs refer to networks with a finite impulse response. Both classes of networks have a temporal dynamic component. Finite impulse recurrent networks are structured as directed acyclic graphs, allowing them to be unfolded and replaced with exclusively FFNN. Conversely, directed cyclic graphs are manifested by infinite impulse recurrent networks, making them non-unrollable. Figure S1 (a & b) illustrates the architecture of the developed RNN in the current study, along with a detailed representation of the RNN structure.

LSTM

The LSTM consists of a sequence of LSTM cells, with each of these cells having gates that control the flow of information within them (Fig. S2 (a & b)). The input gate selectively determines which data from the current input should be preserved and propagated through the cell state. Similarly, the forget gate decides which data should be discarded. Lastly, the output gate determines the specific data from the cell state that should be passed to the output. Alongside these gates, the LSTM cell includes a memory cell that accumulates information over time (Schmidhuber and Hochreiter 1997). This capacity allows the network to selectively retain or forget information as necessary, making it well-suited for processing lengthy sequences of data. During its forward pass, the LSTM network processes input sequentially, handling one element at a time. As each new input element is processed, the network updates its hidden state and memory cell. The final output is often generated by a fully connected layer, taking the final hidden state as input. The gates use sigmoid activation functions to regulate data transmission and a hyperbolic tangent function to process data entering the cell state (Greff et al. 2016). For a detailed mathematical implementation of LSTM, refer to Graves et al. (2012).

GRU

The GRU is a recurrent neural network architecture initially introduced by Chung and colleagues (Chung et al. 2014) (Fig. S3). Similar to an LSTM unit, a GRU unit comprises gating mechanisms that regulate the flow of information through these gates. However, unlike LSTM, the GRU unit does not have a separate cell for storing information. The GRU structure consists of three gates — the candidate activation gate, the update gate, and the reset gate. The reset gate controls the extent to which the previous hidden state is overlooked, while the update gate manages the incorporation of new input data into the current state. By considering the previous hidden state and the current input, the candidate activation function calculates the hidden state of the new candidate. For a comprehensive mathematical explanation of GRUs, refer to Haverkos et al. (2016).

CNN

CNNs are a type of deep learning algorithm that has had a significant impact on computer vision, finding widespread use in such tasks as object detection, semantic segmentation, and image recognition (Sultana et al. 2020). The application of such sophisticated algorithms holds promise for enhancing water quality monitoring. These networks consist of multiple layers and encompass several crucial components, including fully connected layers, convolutional layers, and pooling layers. During the forward pass, the provided information undergoes a sequence of operations. It starts by being fed into the initial layer, which contains a convolutional layer responsible for extracting features from the input data. This extraction is achieved through the application of learned filters acquired during the training process. Following the convolutional layer, the output goes through a pooling layer that reduces spatial dimensionality, thereby helping to prevent overfitting. Finally, the output is passed through one or more fully connected layers that conduct classification or regression operations based on the extracted features. For further information on this methodology, refer to Chua and Roska (1993)

TCN

TCNs were introduced by Bai et al. (2018) as a specialized neural network architecture designed for sequence modeling and prediction/forecasting tasks. TCNs employ 1D convolutional layers, which are adept at capturing temporal dependencies in data while using fewer parameters compared to traditional RNNs. This characteristic makes TCNs more efficient and easier to train. The core structure of a TCN consists of a series of 1D convolutional layers, optionally followed by a fully connected layer and an output layer. Each convolutional component employs fixed kernel sizes and strides, with the number of filters typically increasing as the sequence length decreases. The results from each convolutional component are then passed through a ReLU activation function and merged with the input to the subsequent layer using a skip connection. The skip connection facilitates the flow of information through the network without undergoing filtration by the convolutional layers, thereby addressing the vanishing gradient problem. For a detailed understanding of the implementation of a TCN model, refer to Fu et al. (2021). Figure S4 illustrates the structure of the TCN model used in this study.

Ensemble models

EM integrates the obtained outcomes of different models. Such procedure combines the outputs from different algorithms or models to achieve more accurate results (Hall and Llinas 1997). Regarding the vector of weights w = [w1, w2, …wn]T for the results generated by standalone models, the result of the EM (pj) is performed in the following manner:

$${Y}_j=\sum_{i=1}^n{w}_i{y}_{ij}^{\ast }$$
(1)

Subject to:

$$\sum_{i=1}^n{w}_i=1\kern0.75em ,\kern1.25em {w}_i\in \left[0,1\right]$$
(2)

wherein n and \({y}_{ij}^{\ast }\) are numbers and forecasted values by ith singular models, respectively (i.e., RNN, LSTM, GRU, and TCN). It is pertinent to highlight that the rationale behind opting for a variant of stacking, specifically weighted averaging, instead of pursuing boosting or bagging techniques, is attributed to factors encompassing model diversity (Zhou et al. 2023), complexity and interpretability (Ribeiro and dos Santos Coelho 2020), adaptability (Tang et al. 2022), ensemble size (Zounemat-Kermani et al. 2021), and the careful navigation of trade-offs (Card et al. 2019; Tan et al. 2019). The decision to utilize GA and NSGA-II as optimization methodologies for weighted averaging is underpinned by a multifaceted set of advantages. Firstly, GA and NSGA-II are renowned for their inherent flexibility, making them highly adaptable to a wide array of complex optimization challenges (Goldberg 1989; Katoch et al. 2021). Secondly, these algorithms are adept at generating a rich diversity of solutions, an attribute that is indispensable for thoroughly investigating various combinations of base learners (Deb et al. 2002). Thirdly, their robust design enables them to effectively manage and filter out noisy or irrelevant base learners through integrated feature selection and robustness mechanisms (Sivanandam et al. 2008). Lastly, both GA and NSGA-II demonstrate exceptional proficiency in fine-tuning the weights assigned to each base learner, thereby enhancing the overall efficacy of the ensemble model (Zhang and Li 2007; Xu et al. 2022; Dai et al. 2022).

GA

A GA can function as an EM, which is a machine learning technique that combines multiple models using a GA optimization procedure to enhance the precision and effectiveness of an overall predictive water quality model (Barzegar et al. 2016; Chen et al. 2023a). In this approach, GA is employed to identify the most suitable subset of models from a pool of potential models, which are generally diverse and complementary in their forecasting capabilities (Babatunde et al. 2014). GA optimizes the model ensembling process by iteratively selecting the best combination of models and refining the amalgamation through crossover and mutation procedures. The GA process commences with a population of potential models, subject to fitness evaluation based on their accuracy and other performance metrics. Models with higher fitness are selected and bred to generate offspring, inheriting genetic traits from their parents. The offspring then undergo mutation and crossover operations to introduce diversity and explore new solutions (Mirjalili and Mirjalili 2019). Figure S5 illustrates the GA process used to determine optimal solutions for the Chl-a forecasting model.

EM based on NSGA-II

NSGA-II, introduced by Deb et al. (2000), is a MOO algorithm that combines the principles of GAs with the concepts of dominance and CD. This optimization technique generates population members using crossover and mutation within the GA framework. These members are then categorized into different fronts based on their non-dominance. Subsequently, population members within each front are ranked using the CD operator. In this study, the decision variables within NSGA-II pertain to the weights assigned to different DL models. The optimization process revolves around two objective functions, as outlined below:

$${Z}_1=\frac{\sum_{j=1}^k\ \left|\frac{\sum_{i=1}^n{w}_i{y}_{ij}^{\ast }-{y}_j}{y_j}\right|}{k}\times 100$$
(3)
$${Z}_2=\frac{\sum_{j=1}^m\ \left|\frac{\sum_{i=1}^n{w}_ix-{x}_j}{x_j}\right|}{m}\times 100$$
(4)

where Z1 and Z2 represent the MARE between the EM output and the actual value assigned to the data used for both the training and testing phases. In accordance with the description of these two objective functions, NSGA-II is utilized to find weights that minimize these two objective functions. For clarity, the other variables in Eqs. 14 are elaborated on hereunder:

\({y}_{ij}^{\ast }\):

The ith output value of each model on the TD

yj:

The observed output value of the TD

\({x}_{ij}^{\ast }\):

The ith forecasted value for each instance in the TD

xj:

The recorded value of TD

k:

The quantity of TD

m:

The number of TD

The structure of NSGA-II

MOO techniques, derived from competing objectives, result in multiple optimal solutions in contrast to SOO methods. One particularly effective technique is NSGA-II, a robust method for optimizing multiple objectives (Deb et al. 2002; Zamani et al. 2022; Zamani et al. 2023c). The process begins by creating an initial population of N individuals, followed by calculating the objective function values (Z1 to Zn). Subsequently, the initial population is evaluated and ranked based on the criteria of ND and CD. To enhance the overall response, the process employs two key procedures: mutation and crossover. These operators are used to generate an MP and an OP, respectively. The total population consists of three subpopulations: the original P, the MP, and the OP (Ehsani et al. 2022). This process is iteratively repeated to determine the optimal solution. The flowchart in Fig. S6 illustrates this iterative process.

Cross-validation approach for DL models

Cross-validation serves as a crucial approach for assessing the model’s predictive capacity on new and unseen data, offering essential insights into its ability to generalize beyond the training set. This methodology provides an estimation of the model’s potential performance when applied to unfamiliar data points (Zhou 2012). To mitigate concerns, a strategy involves withholding a portion of the dataset during the training phase. Prior to commencing the training process, this subset of data is segregated and kept separate. Following the completion of the training phase, the remaining data is then utilized to assess the algorithms’ proficiency. This foundational principle underlies a broad spectrum of model evaluation techniques collectively referred to as cross-validation.

The process of partitioning the dataset holds the power to influence the outcomes of machine learning algorithms. Numerous methods have been proposed in previous research endeavors to implement the concept of cross-validation. Nevertheless, these methodologies all share fundamental elements that encapsulate a common essence (Wang et al. 2023). Among the various available techniques, this study opted for the hold-out procedure due to its straightforward and uncomplicated nature. Figure 2 visually presents the employed cross-validation approach used in this study.

Fig. 2
figure 2

The utilization of CV methodology

Model performance evaluation

This study evaluated the forecasting accuracy using various statistical metrics, including MSE, RMSE, MAE, RMAE, R2, NSE, and MARE %. The MSE and its variant, RMSE, delve into the squared differences between forecasted and actual values, effectively capturing both the magnitude and direction of errors. They range from 0 to ∞, with lower values indicating closer alignment between forecasting and observed values (Uddin et al. 2023a, 2023b, 2023c). These metrics emphasize the significance of larger errors due to the squaring process, making them particularly sensitive to outliers. While providing a comprehensive view of overall forecasting accuracy, the squared nature of these metrics might obscure the understanding of the directional bias of errors. Consequently, they might not readily distinguish between underestimation and overestimation tendencies. The RMAE introduces a dimension of normalization, allowing for comparisons relative to the scale of the observed data, thereby aiding in discerning proportionate inaccuracies (Hajikarimi et al. 2022). The R2 metric operates on a scale from 0 to 1. It measures the extent to which the model’s forecasting values can account for the variability in the recorded data. A value nearing 1 indicates a stronger alignment of the model with the data, highlighting a greater level of concurrence between forecasted and recorder values. However, R2 does not explicitly indicate the direction of errors, focusing more on the degree of fit. The NSE measures how well the model’s forecasting outperforms a reference (usually the mean of observed values). Spanning from negative infinity to 1, it shows values approaching 1 to signify enhanced performance. When negative values appear, they indicate that the man of recorder values serves as a more effective predictor than the model, potentially implying overestimation tendencies (Dargi et al., 2023; Ghadermazi et al. 2022). The MARE, expressed as a percentage, provides a relative insight into the accuracy of forecasting in relation to the actual values. The following statistical indices are elaborated upon below.

$$MSE=\frac{1}{n}\sum_{i=1}^n{\left({Chl}_F^i-{Chl}_O^i\right)}^2$$
(5)
$$RMSE=\sqrt{\frac{1}{n}}\sum_{i=1}^n{\left({Chl}_F^i-{Chl}_O^i\right)}^2$$
(6)
$$MAE=\frac{1}{n}\sum_{i=1}^n\left|{Chl}_F^i-{Chl}_O^i\right|$$
(7)
$$RMAE=\frac{1}{n}\sum_{i=1}^n{\left|{Chl}_F^i-{Chl}_O^i\right|}^2$$
(8)
$${R}^2=\frac{{\left[{\sum}_{i=1}^n\left({Chl}_O^i-{\overline{Chl}}_O\right){Chl}_F^i-{\overline{Chl}}_F\Big){}^2\right]}^2}{\sum_{i=1}^n{\left({Chl}_O^i-{\overline{Chl}}_O\right)}^2{\sum}_{i=1}^n{\left({Chl}_F^i-{Chl}_F\right)}^2}$$
(9)
$$NSE=1-\frac{\sum_{i=1}^n{\left({Chl}_O^i-{Chl}_O^i\right)}^2}{\sum_{i=1}^n{\left({Chl}_O^i-{\overline{Chl}}_O\right)}^2}$$
(10)
$$MARE=\frac{1}{n}\sum_{i=1}^n\left|\frac{Chl_O^i-{Chl}_F^i}{Chl_O^i}\right|\times 100$$
(11)

In Eqs. 511, ChlF and \({\overline{Chl}}_F\) are forecasted Chl-a and forecasted Chl-a average, respectively. ChlO and \({\overline{Chl}}_O\) are observed Chl-a values and their corresponding average, while n demonstrates the total number of samples.

The study employed the Taylor diagram to evaluate model accuracy. This visual tool provides a concise representation of model alignment with observed data across multiple metrics. It captures accuracy, precision, and bias dimensions, aiding informed decisions. Components like correlation, standard deviation ratio, and centered root-mean-square difference showcase model behavior. Reference data is at the center, model forecasting along radial lines. Distance indicates correlation, angle reflects root-mean-square difference, and standard deviation ratio quantifies variability. Insights identify models excelling in specific dimensions, aiding trade-off decisions. The Taylor diagram complements quantitative metrics, holistically assessing performance and aiding model selection. It enhances our understanding of model behavior relative to recorder data.

Case study

Study area

Positioned in northeastern Greece near the borders of Albania and North Macedonia, Small Prespa (Mikri Prespa) is one of two interconnected lakes covering approximately 47 Km2, situated at an elevation of 853 m above sea level (Tziritis 2014). The lake primarily replenishes itself through surface runoff originating from various water bodies, as well as lateral sub-surface flow resulting from interconnected aquifer systems within the region. The climate in the area is continental, characterized by hot summers and chilly winters. The landscape transforms into lush greenery during the spring months of March to May, with average temperatures ranging from 10 to 20 °C and increased rainfall. During the winter months, Small Prespa receives a substantial amount of precipitation throughout the year, with the highest rainfall occurring from December to February. Average monthly precipitation during this period ranges from 120 to 140 mm. In addition, snowfall is common during this time of the year (Fijani et al. 2019).

The lake’s water quality is compromised by pollution and human activities. The concentration of Chl-a, the primary pigment for photosynthetic activity in most algae and cyanobacteria, serves as a key indicator of phytoplankton abundance in aquatic environments. Monitoring Chl-a concentrations in Small Prespa provide valuable insights into the lake’s productivity, overall health, and the potential risk of harmful algal blooms. This information is essential for assessing the lake’s ecological well-being and taking necessary measures to preserve water quality and protect the surrounding ecosystem’s health (Van der Schriek, 2020). Figure 3 illustrates the location of Small Prespa, located in Greece.

Fig. 3
figure 3

The study area, situated in Greece (adapted from google.com/maps

Data source

During the period from June 1, 2012, to May 31, 2013, WQVs were meticulously obtained at 15-min gaps. Such data encompassed an array of parameters intricately tied to the chemical and physical attributes of the water, including EC, ORP, pH, water temp., DO, and Chl-a concentration. These measurements were meticulously captured through the utilization of a sensor that boasted multiple probes, each contributing to the comprehensive understanding of water quality dynamics. Strategically located at a depth of 1.5 m below the water’s surface along the northern shoreline of the lake, the sensor’s consistent placement at this depth, approximately 3 m below the water’s surface, was consistently achieved throughout the entire year. By carefully selecting the location for the installation, the acquired data represented the holistic aspects of the lake’s water quality. Moreover, ensuring the equipment’s safety was of paramount importance, safeguarding against any potential external influences that could disrupt or damage the equipment. The accuracy of the sensor measurements was ensured by rigorous calibration procedures prior to deployment. In this process, sensor readings were meticulously compared with standards or reference measurements, enabling potential systematic errors or biases to be identified and corrected. It was necessary to regularly calibrate the sensor during the study period in order to ensure its accuracy and reliability. Verifying the accuracy of the sensor measurements required the comparison of sensor data with measurements derived from well-established laboratory methods. A benchmark was established by collecting samples of water and analyzing them thoroughly using trusted techniques in order to measure the precision of the sensor’s measurements. Using these stringent criteria and adhering to a rigorous method, the data collection process was designed to mitigate the impact of potential errors and biases. Through this comprehensive approach, the data were strengthened in terms of reliability and accuracy, providing a substantial foundation upon which to conduct subsequent analyses and modeling efforts. Table 1 provides a statistical overview of the measured WQVs. Specifically, the Chl-a concentration spans a range from 0.82 to 16.97 μg/L, with a corresponding mean value of 2.66 μg/L. To forecast Chl-a concentration on an hourly basis, the dataset was transformed into hourly time steps, resulting in a reduction of data points from 34,825 to 8706. The transformation was performed using Google Colab’s resampling tool.

Table 1 A summary of statistical measures for the different WQVs in SPL

Figure 4(a–b) illustrates the temporal variations of WQVs collected in the SPL. As shown in Fig. 4(a–b), it highlights fluctuations in Chl-a, DO, pH, water temperature, EC, and ORP. Notably, water temperature, EC, and ORP in the study area exhibited significant fluctuations over the study period. In contrast, the pH value remained stable throughout this time. Additionally, the concentration of Chl-a peaked on October 7, 2012, reaching its maximum value, while it decreased to nearly 1 \({~}^{\mu g}\!\left/ \!{~}_{l}\right.\) by the end of the water quality sampling period.

Fig. 4
figure 4

Temporal variations of WQVs (a) Chl-a, DO, pH, and water temperature and (b) EC and ORP

The correlation plot of all measured WQVs is presented in Fig. 5. There is no strong correlation between Chl-a concentrations and other measured WQVs. This correlation analysis contributes to a deeper understanding of the complex interactions governing aquatic environments. It reveals how different parameters are interconnected and offers insights into potential cause-and-effect relationships (Zhou et al. 2016; Ly et al. 2018; Bui et al. 2020). Regarding the age of the data, no more recent water quality data for Small Prespa Lake has been made available. Although it is true that using up-to-date data can be more useful than that of in the past in the study area, the main purpose of our research is presenting a framework that can capable of forecasting water quality variable, namely, Chl-a using weighted averaging approaches and time-series water quality dataset. It is worth mentioning that various recent studies used such water quality data (Tziritis (2014); Fijani et al. (2019); and Barzegar et al. (2020)), and one of the main goals of the present study is the comparison of these studies with the present study.

Fig. 5
figure 5

Correlation plot among WQVs

Model development

We established four DL models (RNN, LSTM, GRU, and TCN) individually, along with their combination through the GA and NSGA-II ensemble methods. These ensembling techniques were applied to forecast Chl-a concentrations and enhance the predictive outcomes of the individual DL models. Accordingly, the hourly Chl-a dataset was separated into two portions: 75% of the data was allocated for the training phase, while the remaining 25% was designated for the testing phase. Furthermore, the models were implemented on a personal computer operating with the Windows 10 operating system. The computer was equipped with an Intel(R) Core(TM) i7-10750H processor running at 2.60 GHz and 16 GB of RAM. For development, Python 3.9.7 was used, and the DL models were built using the Google Colab IDE and the Keras development framework.

DL models

DL models were developed for forecasting Chl-a in the next time step (i.e., Chl-a (t+1)) in Small Prespa Lake using Chl-a at different lag times as inputs. Within the scope of this study, univariate forecasting was employed as the chosen methodology. Univariate forecasting using lag times offers certain advantages over multivariate forecasting in specific contexts. Here are some key advantages of univariate forecasting using lag times: (1) simplicity and ease of implementation: univariate forecasting focuses on forecasting a single variable’s future values based solely on its past values. This approach is often simpler to implement compared to multivariate methods that require handling multiple variables and their potential interactions. (2) Reduced complexity: univariate models are less complex than multivariate models since they involve fewer variables. This simplicity can make univariate models easier to interpret and require less computational resources. (3) Less data preprocessing: univariate models require only the historical data of the variable being forecasted, reducing the need for extensive data preprocessing and alignment that’s often necessary in multivariate methods. (4) Data availability: univariate forecasting can be advantageous when historical data for other correlated variables might be limited or unavailable. In the context of our study, we observe an intriguing facet: the absence of a substantial correlation between Chl-a and other measured WQVs. This observation underscores the complexity of the relationships governing these variables, implying that univariate forecasting might be a valuable avenue to explore for forecasting Chl-a in this specific scenario. To develop the models, at first, the entire Chl-a dataset was divided into distinct training and testing periods. The training data served a dual purpose: validating and comparing the effectiveness of individual models developed over the training period. In Python, the scikit-learn library was utilized for normalization and scaling, which ranged from 0 to 1. This was achieved through the application of normalization and minimum-maximum scaling techniques. This process helps mitigate abrupt changes in gradients, leading to smoother convergence during training of DL models. During the training of DL models, a function was iteratively refined through trial and error, ultimately selecting the model with the lowest RMSE. Here, lag times ranging from 1 to 6 h were considered: input variables included Chl-a (t), Chl-a (t-1), ..., Chl-a (t-6), and the forecasting focused on one step ahead Chl-a concentration Chl-a (t+1).

To find the best-performing DL model, various model structures were trained. Models with a one-layer input were selected for modeling due to demonstrating the lowest RMSE during the training period, outperforming more complex structures. An essential goal of this research is to comprehensively compare DL models alongside other approaches. For the LSTM, GRU, and RNN models, the parameters of each layer were approximately consistent, comprising 50 neurons with the ReLU as the activation function. The return sequences parameter, which pertains to the return of the hidden state a<t>, was set to true due to the time-dependent nature of the data. Different activation functions were tested for all models, including eLU, Tanh, Softmax, SeLU, Softplus, and ReLU. Among these, the ReLU function yielded the lowest RMSE, demonstrating its suitability for analyzing time series data. At each training step, the dropout layer introduces randomness by setting input units to zero at a specified rate, aiming in mitigating the issue of overfitting. The remaining inputs, those not set to zero, are adjusted by scaling factor of \(\frac{1}{\left(1- rate\right)}\) to maintain the overall sum of all inputs, where the rate is the proportion of inputs that are set to 0. For example, if 20% of the inputs are set to 0, then the rate would be 0.2, and the scaling factor would be 1.25. The objective of this study is to evaluate and compare the outcomes obtained from individual DL models with those of the EMs. To ensure an equitable comparison, a uniform dropout layer value of 0.001 was applied to all DL models. This value was chosen due to its superiority over other values, as it effectively mitigated overfitting, enhanced generalization, and expedited the convergence of the DL model (Barzegar et al., 2021). Furthermore, the optimization of other parameters for the DL models was performed using a trial-and-error approach. For instance, in the LSTM model, multiple layers were utilized, with the input layer incorporating lag times of Chl-a as inputs to each subsequent stage. To construct this model, an LSTM-based hidden layer was employed, utilizing a ReLU activation function. The hidden layer comprised 64 and 32 units dedicated specifically to Chl-a. After the LSTM layer, a dropout layer with a rate of 0.001 was added to help reduce the degree of overfitting. Subsequently, a fully connected layer known as “Dense” was introduced. This model was developed using the “Adam” optimizer and the “MSE” loss function. The “Adam” optimizer is a gradient-based algorithm that adjusts dynamically. Additionally, a learning rate of 0.01 was employed in the optimization process. Notably, the number of epochs was set to 100, and a validation split of 0.15 was utilized. Figure 6 illustrates plotted graphs depicting the loss function of the developed DL algorithms, namely LSTM and GRU. In this representation, the x-axis corresponds to the training iterations or epochs, while the y-axis denotes the loss function’s values. Notably, the convergence observed in these graphs indicates the models’ approach towards optimal performance. Both LSTM and GRU models exhibited commendable predictive performance in forecasting Chl-a concentration. Furthermore, the consistent decrease in loss values across both the training and testing sets attests to the models’ suitability, indicating a well-fitted and well-generalized behavior for both algorithms.

Fig. 6
figure 6

Loss function of the DL models for Chl-a forecasting

One of the principal aims of this study is to establish a Chl-a forecasting model using TCN in order to thoroughly investigate the capabilities of this model. Given the inherent complexity of DL models, the task of identifying the optimal TCN network structure and hyperparameters holds significant importance. The selected values for the residual block, kernel size, and number of filters were {1, 2, and 3}, {4, 8, and 8}, and {32, 64, 128}, respectively. Furthermore, the input data size was set to six, indicating that a combined set of six-time intervals, comprising both present and past data, was utilized for forecasting future values. Subsequently, all structures were meticulously explored, considering various permutations of the following parameters: batch size options {32, 64, and 128}, and epoch values of {20, 50, and 100}. In total, 243 experiments were conducted on the dataset as part of this research endeavor. The objective was to pinpoint the optimal network structure and hyperparameter values across diverse scenarios. Table 2 provides an in-depth summary of the optimal parameters for the trained DL algorithms. The hyperparameter configuration by the trial and error procedure for the developed models brings forth a multitude of distinct advantages. These encompass the fine-tuning of models to align with the domain-specific characteristics of the dataset, the maintenance of controlled model complexity to mitigate overfitting risks, the efficient utilization of computational resources compared to the automated techniques, the harnessing of domain expertise for informed parameter choices, and adeptness at adapting to scenarios with limited data to thwart overfitting.

Table 2 The summary of DL algorithms in the current study

Ensemble-based singular algorithms

The ensembling of DL models using GA and NSGA-II was utilized to enhance the optimal values of EMs parameters. In this approach, both GA and NSGA-II algorithms were utilized to discover the optimal combination of weights or coefficients for individual DL models. For GA, a MATLAB-based optimization method inspired by natural selection was developed. It involves generating a population of candidate solutions and then iteratively evolving them using principles from genetics and evolution. In terms of the objective function for GA, the evaluation metric, MARE, was minimized to achieve its minimum value. Table 3 presents the parameters of the GA algorithm along with its optimal value. Additionally, Fig. 7(a) illustrates the optimal solutions achieved for forecasting Chl-a using GA. This result emerged when the fitness value and best fitness exhibited equal values under specific circumstances.

Table 3 The optimal parameter values for the GA and NSGA-II algorithms
Fig. 7
figure 7

The Ems’ results: a the MARE between the forecasted and observed Chl-a values during the training and testing periods and b the tradeoff between two considered objective functions

The optimization process of NSGA-II entails generating a set of potential solutions that are at a superior level, followed by evaluating the optimization criteria for each individual solution. From this set, the ND solutions are selected. The solutions chosen through this algorithm constitute the optimal Pareto front, offering a spectrum of trade-off solutions that balance multiple objectives.

Considering objective functions in NSGA-II, minimizing the MARE between EM output and actual value for training the model was considered. Figure 7(b) illustrates the compromises between two distinct objective functions pertaining to the EMs designed for Chl-a forecasting. In Fig. 7(b), the presentation showcases individuals within the population found on a particular non-dominated front, where none of the members exhibit dominance over others. Within the NSGA-II model, a count of 100 members and generations was adopted. Careful consideration was given to the number of members and generations within the NSGA-II model, resulting in a curated list of five members. This selection was meticulously ranked based on the CD criteria. The utilization of the CD criteria serves as a pivotal aspect of the model’s design, aimed at augmenting the diversity and representation of non-dominated solutions spanning the Pareto front. This strategic approach bolsters the robustness and efficacy of the NSGA-II algorithm, especially when tackling complex MOO problems. By prioritizing a well-distributed set of solutions that showcase Pareto optimality, the NSGA-II algorithm achieves a refined equilibrium between competing objectives, resulting in optimal outcomes that resonate with the inherent intricacies of real-world scenarios. Table 3 presents the parameters of NSGA-II and their corresponding optimal values. Additionally, Table 4 displays the five optimal solutions generated by NSGA-II, indicating the optimal weights for each DL model and the optimal values of the objective functions.

Table 4 The initial quintet acquired through NSGA-II

Results

DL models were developed based on hourly time steps of Chl-a data in SPL, located in Greece. The test data were employed to evaluate the performance of each individual DL model and EMs, respectively. Table 5 provides a comparative analysis of the evaluation metrics for the developed DL models, as well as the model ensembling, for both the training and testing periods. Additionally, Figs. 6 and 7 provide graphical comparisons of some of these evaluation metrics.

Table 5 Performance of the developed DL (RNN, LSTM, GRU, and TCN) and ensemble-DL models for Chl-a forecasting

Figure 8 illustrates the comparison of evaluated indices for the models developed throughout the training and testing periods. Considering R2 metric, EM exhibited the highest accuracy in comparison to the other DL models. With the R2 metric ranging from 0 to 1, these findings indicate that the EM outperformed the other models in terms of accuracy. Specifically, these differences amounted to 14%, 6%, 4%, and 7% for RNN, LSTM, GRU, and TCN, respectively. During the testing period, the Nash-Sutcliffe efficiency (NSE) values ranged from 0.72 to 0.84, classifying the performance as good (0.65 ≤ NSE ≤ 0.75) to very good (0.75 < NSE ≤ 1.00) according to the classification by Moriasi et al. (2007). This assessment pertains to the forecasting of Chl-a using both individual models and EM. Better model performance is indicated when the RMSE is closer to 0. Across both the training and testing phases, EM consistently outperformed the other DL models. Further analysis of this index, considering the nature of individual DL models, revealed that RNN had the weakest performance due to challenges related to gradient vanishing and exploding. In contrast, the development of EM capitalized on the strengths of each individual model, resulting in the lowest RMSE.

Fig. 8
figure 8

The assessment of the models’ accuracy — R-squared as well as NSE — for DL and EMs

All the models assessed in Table 5 exhibit remarkably effective performance, evident from their high R2 values, which are close to 1. Notably, EM-NSGA-II stands out with exceptional effectiveness compared to DL and EM-GA models, showcasing improvements of 14% (RNN), 8% (LSTM), 6% (GRU), 8% (TCN), and 3% (EM-GA) during the testing phase.

Figure 8 illustrates the comparison of efficiency among the developed models. From both Fig. 8 and Table 5, it is evident that the RMSE, MAE, and MSE values for the models are all below 0.3, signifying highly accurate models for forecasting Chl-a concentration. Regarding model performance assessed through errors, the results indicate that most of the DL and EM models demonstrated similar performance in forecasting Chl-a concentration during both the training and testing periods. However, the EM-NSGA-II model distinctly stood out as the top performer (Fig. 9).

Fig. 9
figure 9

The assessment of the models’ accuracy — MSE, RMSE, and MAE — for DL algorithms and EMs

In Fig. 10, the spider plot visually represents the performance metrics employed in the present study. These diagrams pertain to the evaluation of the developed models during the testing phase of DL and EMs development. Notably, based on this analysis, it is evident that the EM-NSGA-II algorithm consistently outperformed the other developed models, which include RNN, LSTM, GRU, TCN, and GA-based EMs, across a spectrum of evaluation criteria. This superior trend was pervasive across nearly all algorithms. Conversely, the RNN model exhibited the least accuracy in forecasting Chl-a concentration, as evidenced by its highest values for MSE, MAE, and RMSE, coupled with the lowest values for NSE and R-Squared. As for the other individual DL and EM-GA models, their efficacy for Chl-a forecasting is generally satisfactory, with the EM-GA model demonstrating a marginally superior performance compared to the other developed models.

Fig. 10
figure 10

Spider diagram illustrating the model’s performance in Chl-a forecasting

The time series plots in Fig. 11(a) depict the forecasted versus observed Chl-a concentrations during TSP. The plots provide a visual comparison of the actual and estimated Chl-a values for four DL models (e.g., RNN, LSTM, GRU, and TCN) and two EM models (e.g., EM-GA and EM-NSGA-II). Among the individual DL models, GRU is better suited to capturing the low and high concentrations of the observed Chl-a compared to RNN and LSTM, likely due to its specialized gating mechanism.

Fig. 11
figure 11

The hydrographs of actual and estimated values during the TSP for DL models, including RNN, LSTM, GRU, and EM-NSGA-II as EMs

According to Fig. 11(b), the R2 metric demonstrates the model’s tendency to underestimate positive errors while overestimating negative errors. In the scatter plots shown in Fig. 11(b), points positioned above the line of equivalence indicate that the predicted Chl-a concentration is lower than the actual Chl-a concentration. The proximity of points to the line of equality directly corresponds to the quality of forecasting performance. The high similarity between actual and estimated Chl-a concentrations, coupled with the majority of points positioned close to the line of equality, signifies that the GRU model outperforms the other DL models in terms of the R2 metric.

The study demonstrates that EMs perform better than individual DL models in forecasting Chl-a concentrations. The comparison of time series plots between observed and forecasted Chl-a levels shows that EMs, in general, are more effective in handling complex datasets and producing accurate forecasting. The EM-NSGA-II, which uses two objective functions, was found to be more accurate than the EM-GA in forecasting both low and high concentrations of Chl-a. However, the NSGA-II-based EM model required more training time than the GA-based EM model. Despite this, the increased accuracy of the NSGA-II-based EM highlights the importance of using multiple objective functions in EM models to enhance their forecasting efficiency. Overall, the study suggests that the combination of DL models through EMs provides a powerful approach for forecasting the Chl-a concentration. Furthermore, the use of optimization techniques such as NSGA-II and GA can further enhance the accuracy and robustness of EMs.

To evaluate the effectiveness of the models, a Taylor diagram (Fig. 12) was employed. This diagram offers a visual representation for comparing models and determining their accuracy. It illustrates how well a model’s forecasts align with observed data and aids in identifying the most realistic model. As per the diagram, the EM-NSGA-II model exhibited the best results for Chl-a forecasting. Upon examination, the EM-NSGA-II model displayed a strong positive relationship, as evidenced by a correlation coefficient of 0.98, along with a standardized deviation of 0.93. These values indicate a robust correlation and a close match between actual and forecasted Chl-a levels. This substantiates the EM-NSGA-II model as the most accurate and realistic representation of Chl-a forecasting among the tested models. Conversely, the RNN model yielded the least favorable results, with a correlation coefficient of 0.91, indicating a substantial positive association, and a normalized standard deviation of 1.06, reflecting moderate variability. These metrics suggest a lower correlation and greater variability between observed and forecasted Chl-a concentrations. The relatively poorer performance of the RNN model suggests that it might not be the most suitable choice for Chl-a concentration forecasting.

Fig. 12
figure 12

Graph illustrating the performance of the models in estimating Chl-a using the Taylor diagram

Discussions

While the developed DL models revealed that the RNN exhibited relatively weaker performance compared to other individual DL models in Chl-a forecasting, it’s important to note that the RNN model still demonstrated acceptable accuracy. This can be attributed to RNN’s inherent capability to retain information from past inputs and incorporate them in processing new inputs, making it suitable for time series analysis. However, the RNN model encounters challenges when forecasting high values of Chl-a in an accurate manner. This limitation can be attributed to the gradient vanishing problem that arises during the backpropagation process. When gradients become exceedingly small, they tend to vanish, resulting in prolonged training periods and suboptimal performance. Conversely, gradient exploding occurs when gradients become overly large during backpropagation, causing weights to update too rapidly and introducing instability during training. Despite these challenges, the RNN model’s capacity for temporal memory and sequence processing underscores its viability for certain aspects of Chl-a forecasting.

The LSTM model, a variant of RNN, stands out as a powerful tool for forecasting time series data. Given the current parameters, it proves to be a valuable choice for forecasting upcoming Chl-a levels. In contrast to the basic RNN model, LSTM offers numerous advantages. These include its ability to handle dependencies over extended timeframes, its resilience against the vanishing gradient issue, and its flexibility in accommodating input sequence lengths. Consequently, the LSTM model tends to deliver superior performance compared to RNN, particularly when it comes to forecasting high Chl-a values. Furthermore, the accuracy of the GRU model surpasses the results achieved by both RNN and LSTM. Notably, the primary distinction between LSTM and GRU lies in their architectural design. While LSTM employs three gates to regulate information flow and a cell state to store information, GRU uses reset and update gates along with a hidden state acting as a memory unit. The simplicity and reduced parameter count of GRU make it ideal for faster training and more suitable for smaller datasets. However, LSTM has demonstrated its prowess in tasks that require handling long-term dependencies. Consequently, the choice between the two models hinges on the specific task and dataset characteristics. Differentiating itself from the other DL models developed in this study, the TCN leverages temporal convolutions for sequential data processing, diverging from the recurrent connections employed by other models. TCNs employ convolutional layers to capture higher-level temporal dependencies by convolving the input sequence with a set of filters in a sliding window manner. EMs, combining the predictive outputs of RNN, LSTM, GRU, and TCN, stand as a robust approach for enhancing Chl-a concentration forecasts. This ensemble of models leads to more accurate Chl-a forecasting, leveraging the strengths of each constituent model. The use of EMs in the present study outperformed the individual DL models for several reasons:

  1. 1.

    Reducing model bias and variance: Individual models may suffer from bias or variance, which can limit their accuracy. By combining the forecasting values of multiple models, the bias and variance can be reduced, leading to more accurate forecasting. Therefore, in the present study, EM-GA and EM-NSGA-II had better performance in comparison with the individual models.

  2. 2.

    Capturing diverse perspectives: Individual models may have different strengths and weaknesses. By combining models with diverse perspectives, an EM can capture a wider range of information and make more accurate forecasting. For instance, while the RNN model is an acceptable model to predict time series analysis, it cannot forecast high value of Chl-a concentration due to the gradient vanishing and exploding. Traditional RNNs can suffer from gradient vanishing and exploding gradients, which can hinder their ability to learn and converge to a solution. LSTM and GRU models solve this problem through the use of gated cells that choose to keep or discard information selectively as time progresses. While LSTMs are recognized for their capacity to capture extended relationships over time, GRUs have a simpler architecture that makes them faster to train and less prone to overfitting. The EMs used in this study combine the strengths of RNNs, LSTMs, GRUs, and TCNs to produce more accurate forecasting of Chl-a concentrations.

  3. 3.

    Handling complex data: Some datasets may be too complex for a single model to capture all the nuances. An EM can leverage the strengths of multiple models to handle complex data and make more accurate forecasting.

  4. 4.

    Robustness: Ensemble models tend to be more robust to noise or outliers in the data, as the forecasting of individual models can cancel each other out. This can lead to more stable and reliable forecasting.

The findings of the current study, which highlight the superiority of EMs over individual DL models in forecasting Chl-a concentrations, align with several existing studies in literature. For instance, Barzegar et al. (2020) showcased the enhanced performance of a model combining LSTM and CNN when predicting Chl-a and DO, outperforming individual CNN and LSTM models. Interestingly, the current study’s results demonstrated that the developed EMs exhibited even better Chl-a forecasting performance than the CNN-LSTM combination proposed by Barzegar et al. (2020). Similarly, the investigation conducted by Barzegar et al. (2018) indicated that ML models employing wavelet functions yielded improved forecasts for EC compared to individual models. Nonetheless, the current study developed EMs surpassed the accuracy achieved even by the combination of ML models with wavelet transforms. Another study by Gao et al. (2020) concluded that their developed EMs outperformed a hybrid model, showcasing the advantages of ensemble strategies. Furthermore, Song et al. (2023) integrated metaheuristic optimization algorithms with LSTM to optimize DO forecasts, with their results demonstrating the superior performance of EMs based on GA and NSGA-II. In yet another context, Wu and Wang (2022) presented a fusion model incorporating ANN, discrete wavelet transform, and LSTM to forecast DO, yielding improved results compared to individual models. Remarkably, our study’s EMs for Chl-a concentration forecasting delivered even higher accuracy in terms of evaluation indices, further emphasizing the strength of our proposed ensemble approach. Uddin et al. (2023a, 2023b, 2023c) introduced the Irish Water Quality Index (IEWQI) framework, which was developed to assess the quality of transitional and coastal waters. The primary goal was to enhance the methodology and establish a tool relevant to environmental regulators with the aim of tackling water pollution. The study’s findings underscored the promising effectiveness and reliability of this index as a more accurate means of evaluating the quality of transitional and coastal waters. Lingxuan Chen et al. (2023b) developed a hybrid algorithm for WQVs forecasting in rivers, which outperformed individual models. It is worth mentioning that, considering the evaluation indices, the EMs developed in the current research lead to better results in comparison with their model.

The application of developed models in water resource management

DL and EMs show great potential for forecasting WQVs, including Chl-a. Chl-a serves as a widely employed measure of water quality, as it is a proxy for the abundance of phytoplankton in the water, which can impact aquatic ecosystems and human health. Traditional approaches for Chl-a forecasting involve statistical models or physics-based models that require a significant quantity of input variables, including meteorological and hydrological data, and WQVs. However, DL and EMs can provide a more accurate and efficient alternative to traditional models by leveraging the power of neural networks to learn complex relationships between data.

In addition, EMs can be used to improve the accuracy of Chl-a forecasting by combining data from multiple sources, such as satellite images, in situ measurements, and environmental parameters. For example, an EM can combine satellite images of the water surface with in situ measurements of Chl-a concentrations to provide more accurate and comprehensive forecasts. DL and EMs have benefits in their ability to manage extensive and intricate datasets, making them especially valuable in modeling Chl-a levels in water bodies. Another advantage is that these models can learn from historical data to forecast future Chl-a concentrations, which can be useful for monitoring water quality over time and detecting potential problems before they become more serious.

Integrating a hybrid DL model into water quality systems can significantly improve the management of water resources by providing accurate and timely data on Chl-a concentrations in water bodies. The occurrence of algal blooms resulting from high Chl-a levels can be mitigated by implementing an early warning system. The hybrid DL model can help water managers mitigate the risks of harmful algal blooms, which can result in fish kills, unsafe drinking water, and even human and animal illness or death. Additionally, the model can function as a tool for supporting decision-making in forecasting and managing water quality by providing information on when and where Chl-a levels will be high. The hybrid DL model helps water managers minimize harmful algal bloom risks, make informed decisions, and enhance monitoring efficiency. Traditional monitoring methods require frequent, costly sampling, while the hybrid DL model allows for continuous monitoring and real-time data analysis, reducing the need for manual sampling.

We acknowledge limitations in this study, most notably concerning the period of the dataset used. The data, which only extends up to 2013, imposes constraints on the applicability of our findings. Specifically, our model does not account for potential changes in the lake’s ecosystem that may have occurred post-2013 — a significant concern given the dynamic nature of ecosystems and their susceptibility to both natural and anthropogenic influences over time. While this limitation is mitigated to some extent by the current lack of more recent, publicly accessible data, it remains a constraint. Our study, therefore, serves as the most current analysis possible within these data limitations. On the methodological side, our approach does not employ a multivariate forecasting strategy, which means it is not influenced by other potentially impactful variables, such as climate data or additional water quality parameters. While this simplifies our model, potentially increasing its robustness given the age of the data, it is a limitation that future research could address. Nevertheless, the methodology and analyses presented here could be applied directly to newer datasets as and when they become available, thereby validating and enhancing the utility and relevance of our current findings.

Conclusions

Monitoring and forecasting water quality in water bodies is crucial for managing water resources, as it has a substantial influence on environmental processes and the welfare of both humans and animals. This study aims to develop an EM using single- (GA) and multi-objective (NSGA-II) optimization algorithms that utilize DL models separately, with the goal of forecasting the Chl-a concentration in the SPL located in Greece. Data on Chl-a levels and other WQVs were collected using a sensor from June 1, 2012, to May 31, 2013. To forecast the Chl-a concentration, four different DL models — RNN, LSTM, GRU, and TCN — were developed, and their results were assessed and compared with each other. Subsequently, EMs based on the GA and NSGA-II algorithms were developed to improve the obtained results of individual models. The study revealed that the GRU model exhibited superior performance in comparison with other individual DL models in Chl-a forecasting due to its simpler structure compared to LSTM and its ability to address the problems of RNNs. Additionally, EMs, which are based on SOO and MOO algorithms, demonstrated superior results when compared to individual DL models.

Based on the insights gained and the results achieved in this study, several promising avenues for future research in water management and water quality forecasting can be outlined:

  1. 1.

    Subsequent studies could delve into the application of decomposition tools before implementing DL models. Among these tools, continuous wavelet transform, fast Fourier transform, and VMD, among others, hold potential.

  2. 2.

    Future research could delve into exploring additional hybrid models such as RNN-LSTM, RNN-TCN, GRU-TCN, WT-GRU-TCN, LSTM-TCN-WT, and GRU-TCN-VMD, and comparisons drawn with the present study could be insightful.

  3. 3.

    Understanding and modeling uncertainty emerges as a pivotal endeavor within the realm of DL and EMs. Addressing this facet can significantly enhance the reliability of forecasting and ensuing decisions. Within this context, various types of uncertainty, including aleatoric, epistemic, and model uncertainty, warrant thorough consideration.

The primary contribution of this study lies in the formulation of an ensemble approach that harnesses the complementary strengths of distinct DL models, effectively addressing the complexities tied to forecast dynamic water quality variables. Our approach demonstrates heightened predictive performance compared to individual models, highlighting the potency of synergistic model amalgamations. Additionally, the availability of the water quality data used for the training and testing phase, model assumption, transferability, temporal and spatial variability, model interpretability, computational resources, and so on can be considered as limitations and implication of the research.