Introduction

Transitional water bodies, such as coastal lagoons, are situated at the interface between the continent and the sea. These are active areas delivering essential ecosystem services (Mooney et al. 2009; Newton et al. 2018), and they occupy approximately 13 % of the world’s coastline (Barnes 1980). In these semi-closed water bodies, the gradient from fresh to saline water creates a rich biodiversity (Basset et al. 2013). In recent decades, Mediterranean coastal lagoon ecosystems have been especially exposed to anthropic eutrophication, primarily due to urbanization (Zaldívar et al. 2008; Souchu et al. 2010). Indeed, limited exports to the open sea and long periods of residence time in these water bodies have resulted in the accumulation of nutrients, mainly from anthropogenic activities (de Jonge et al. 2002; Newton et al. 2014). This excess of nutrient inputs contributed to the eutrophication of the water columns (Cloern 2001; Souchu et al. 2010).

Several studies have been conducted in coastal Mediterranean lagoons to assess the eutrophication level. García-Ayllón (2017) stated that the Mar Menor lagoon, located in the East of the Region of Murcia in Spain, has suffered from an important process of intense anthropization over the last five decades. One of the main indicators was the rapid population growth of new jellyfish species, reaching more than 100 million specimens each summer (Robledano et al. 2011). Thau Lagoon is another particularly interesting case of Mediterranean coastal lagoon eutrophication. Thau lagoon is an ecosystem located at the Mediterranean French coast, known by supporting traditional shellfish farming activities in France and which has been subject to eutrophication leading to large anoxic events associated with significant mortality of shellfish stocks (Derolez et al. 2020). In this context, the North Lagoon of Tunis, a South Mediterranean lagoon, located in the north of Tunisia, provides a good example for the diagnosis and study of eutrophication in Mediterranean coastal ecosystems. In fact, the North Lagoon of Tunis is one of the most important restored lagoons in Tunisia, which has known a critical ecological state substantially due to urban development (Harbridge et al. 1976).

Chlorophyll-a (Chl-a) is the most essential pigment in aerobic photosynthetic organisms and its measurement is used as an indicator of the phytoplankton biomass present in water and thus, the degree of eutrophication of the ecosystem (Tian et al. 2017). The probable presence of algae blooms that have a major effect on the physical, chemical and biological processes of the lagoon can be interpreted as elevated Chl-a levels (Tian et al. 2017).

Cyanotoxins generated by cyanobacteria in lagoon water could present a risk to human health (Watzin et al. 2006; Mc Quaid et al. 2011; Kalaji et al. 2016). In situations where the current cyanotoxin concentration is not available, Chl-a is also widely recognized as a surrogate indicator of cyanobacterial density (Wheeler et al. 2012). It is therefore important to monitor Chl-a concentration and, in turn, to provide information on the management of water quality.

The ability to automatically monitor water quality is very useful, especially in vulnerable areas where (1) there is a high threat of potential pollution events and (2) related socio-economic activities that involve preventive actions are carried out (Jimeno-Sáez et al. 2020). However, as far as we know, there is no automated system that reliably measures the concentration of Chl-a in real time (Jimeno-Sáez et al. 2020). Measurements of Chl-a must be performed in laboratories, which implies high latency and high cost (Jimeno-Sáez et al. 2020). To avoid such inconveniences, most ecologists have recently been using modeling techniques.

In terms of model building, there are typically two approaches in ecological modeling: (1) physically-based (or conceptual) and (2) data-driven-based models (Babovic et al. 2005; He et al. 2014; Zhang et al. 2016). Physical models are complex and require complex mathematical methods, sufficient physical data and high expertise and experience in their implementation (Aqil et al. 2007; He et al. 2014). While data-driven models are easier to implement, not so complex, and remove the need for specialized knowledge of physical processes controlling the transport of pollutants (Bowden et al. 2006), which make them popular and widely used for modeling complex natural processes, mainly in predictive modeling. They can be helpful and valuable for modeling and predicting eutrophication episodes in any natural ecosystem (Nayak et al. 2005; He et al. 2014).

The variables that influence Chl-a content in water bodies are numerous and complex (Jimeno-Sáez et al. 2020). In the existing literature, different statistical approaches have been used to predict Chl-a concentrations based on regression analyses (Su et al. 2015). However, these traditional methods of data processing usually use a linear relationship to simplify complex problems, resulting in unsatisfactory results because they are not sufficiently efficient to deal with complex non-linear relationships between the variables involved (Su et al. 2015).

Machine learning (ML) algorithms have been shown to be more efficient than conventional data processing methods in assessing water quality (Abba et al. 2017), as they are well suited for predicting non-linear and complex functions. Previous studies have confirmed the superiority of ML over traditional approaches in modeling water quality parameters (Charulatha et al. 2017). Among ML techniques, we can mention Artificial Neural Networks (ANNs). ANNs imitate learning processes of a human brain through training and calibration of the network. This skill makes ANNs useful tools for analyzing complex situations that are difficult to explain with traditional methods (Daliakopoulos et al. 2005; Samarasinghe 2007).

The ability to capture system dynamics and non-linearities makes ANNs especially appropriate for the investigation of natural systems, which usually have distinctive spatial-temporal variability (ASCE 2000).

ANNs algorithms have also been used to study Chl-a dynamics, since it is one of the variables representing algal biomass, and has been considered one of the early warning preventive approaches to avoid the occurrence of potential algal blooms (Chen et al. 2015). In a recent study, Li et al. (2017), applied different types of ANNs to estimate the concentration of Chl-a in 27 lakes in China, respectively. In recent study, Tian et al. (2017) used an ANN to predict Chl-a concentrations to an estuary reservoir in East China. Considering that the ANNs achieved the best result for different water quality parameters in several studies (Abba et al. 2017; Keller et al. 2018), including Chl-a, it can be expected that these models will obtain satisfactory Chl-a estimation results in this study.

There are also many classifications in ANNs, such as the Back Propagation Networks, Radial Basis Function Networks, etc. The Back Propagation network is the most commonly used learning algorithm (Rajaee et al. 2019). In this study, a recent ANN approach namely, a nonlinear autoregressive with exogenous inputs (NARX) neural network was developed. NARX is neural network that belongs to the non-linear back propagation dynamic neural networks (Hayken 1999).

In terms of time and cost, reducing the number of variables to be measured is very important. For this reason, it is of a great interest to select specific variables that are most related to Chl-a levels. ML provides an efficient technique to do that. It is called the Random Forest model (RF).

RF has been applied in many studies. Indeed, in 2016, Béjaoui et al. have investigated with the RF model the most important predictor variables for Chl-a variations in the lagoon of Bizerte, located in the north of Tunisia. In another more recent research study, Béjaoui et al. (2018) have used the RF model to study the dynamic of the plankton in Ghar Melh lagoon, located in the north of the Tunisian Mediterranean coast.

It is known that an early-warning proactive approach of the Ch-a content is essential to prevent or mitigate the occurrence of eutrophication episodes, especially in sensitive ecosystems (Jimeno-Sáez et al. 2020) like the North Lagoon of Tunis. For this reason, the use of NARX is essential to perform a one-step ahead forecasting of Chl-a values in our study of the North Lagoon of Tunis.

Given the superiority of the ML algorithms, this study has been conducted to achieve the following objectives: (1) to select the specific variables that are the most related to Chl-a concentrations in the North Lagoon of Tunis, using, especially, RF model. To do so, different variables combinations were tested, (2) to develop an ANN network to estimate and forecast one step ahead of Chl-a concentrations based on NARX neural network and to (3) validate the performance of the model.

Several studies have been carried out on the eutrophication process and water quality parameters of the North Lagoon of Tunis (Ben Charrada 1992; Rezgui et al. 2008; Trabelsi-Bahri et al. 2013). However, to the very best of our knowledge, there is no previous research using ML models to predict water quality parameters in this lagoon, specifically Chl-a concentrations.

Materials and Methods

Study Area

The North Lagoon of Tunis is a well-mixed shallow coastal seawater lagoon located in north of Tunisia (36°45′–36°52′ N and 10°10′–10°20′ E) and Southern Mediterranean Sea (Fig. 1). Covering about 22 Km2 with an average depth of 2 m (range between 0.5 and 3.5 m). This lagoon is one of Tunisia’s main shallow water bodies (Trabelsi-Bahri et al. 2013). It is connected to the open sea at the Gulf of Tunis, where water is exchanged with the Mediterranean Sea through Kheireddine channel, which measures 800 m in length and 40 m in width and has a mean depth of approximately 2.5 m (Ben Charrada 1992).

Fig. 1
figure 1

Location of the study area and water quality monitoring stations (1–5). Arrows inside the map represent the unidirectional inlet/outlet water circulation system

In 1985, a major restoration project has been undertaken in this lagoon to stop pollution and eutrophication (Van Berk and Oostinga 1992). The ultimate goal of this project was to achieve a good chemical and ecological status in this eutrophicated lagoon and to achieve extensive land reclamation around it (Trabelsi-Bahri et al. 2013). The goal was also to reduce the water retention time in the lagoon (Trabelsi-Bahri et al. 2013). These aims have been successfully accomplished by creating an inlet/outlet tide-driven circulation system following the construction of a longitudinal (East-West) separation dam across the lagoon and inlet/outlet gates at the entrance of the canal connecting the lagoon to the open sea (Trabelsi-Bahri et al. 2013). The gates and the separating dam permitted a good circulation (Fig. 1) of the lagoon’s water (Van Berk and Oostinga 1992). In addition, the shoreline was straightened to prevent water stagnation (Trabelsi-Bahri et al. 2013).

The restoration project resulted in a clear improvement of the biodiversity (Trabelsi-Bahri et al. 2013). As a matter of fact, in January of 2013 the North Lagoon of Tunis was deemed a “wetland of international importance”, Ramsar site (Mdaini et al. 2019). However, being aware of the importance and fragility of this ecosystem, it must always remain under observation.

Data Analysis

The monthly concentrations of Chl-a data along with physicochemical parameters of water quality of the North Lagoon of Tunis for the period from January 1989 to April 2018 were collected.

In the present study, a set of seven environmental variables known to affect Chl-a concentrations were monitored: Secchi depth, dissolved oxygen, total phosphorus, total nitrogen, pH, water temperature and practical salinity (Sp; IOC et al. 2010) called salinity in this paper. Sampling campaigns have been carried out from February 2014 through April 2018 at five sampling stations (Fig. 1). Water samples were collected about 10–20 cm below the water surface according to standard methods. In addition, Al-Buhaira Invest company, which is in charge of the ecosystem, provided us with long monthly time series, as a part of the monitoring program for the lagoon, in order to gather information on the physical and chemical characteristics of the ecosystem.

All ML techniques in addition to linear model building were performed using the MATLAB software MATLAB® software (version 9.3.0.948333 (R2017b), The Mathworks, MA, USA).

Chl-a Concentrations

As above-mentioned, Chl-a concentrations were used as the eutrophication indicator, thus to assess the ecological status of the studied area. Water samples of 500 mL for Chl-a measurements were filtered through a 0.45 μm pore-size membrane (Millipore). Chl-a was extracted in 10 mL of 90 % acetone for 24 h in the dark at -20 °C following the protocols published by Parsons et al. (1984). The extract concentration was analyzed spectrophotometrically according to the method of Lorenzen (1967).

Figure 2 illustrates that the temporal variability of the Chl-a in the North Lagoon of Tunis, is relatively low and approximately similar for stations 1, 2, 3, and 4, and is significantly higher at station 5, which is the furthest from the sea water inlet gate.

Fig. 2
figure 2

Temporal variability of Chl-a in the North Lagoon of Tunis at: (a) station 1; (b) station2; (c) station 3; (d) station 4; (e) station 5; (f) mean concentration of the five sampling stations

Physico-chemical Variables

Physico-chemical variables, including water temperature and salinity were measured in situ using a WTW LF325 conductivity meter. pH was measured by a pH 330i WTW pH meter.

The Secchi depth of the lagoon was measured with a 25 cm diameter Secchi disc. the Secchi depth is the visibility of the Secchi disc. In fact, the Secchi depth is a parameter indicator of the transparency of the water column and it is the depth of disappearance of the Secchi disc.

Since the end of the restoration works, the visibility of the lake bottoms has improved significantly. In the absence of strong wind the rapport (transparency / depth) generally exceeds 90 % (Shili 1995).

Dissolved oxygen was measured by OXY 197 oxymeter. Figure 3 shows that the temporal (seasonal and interannual) variations of dissolved oxygen is relatively similar at all five stations in the lagoon. This figure further illustrates that the lagoon remains well oxygenated from the inlet to the outlet gates. Thus, suggesting that the dissolved oxygen air-sea exchanges are efficient and that the lagoon ecological state remains healthy.

Fig. 3
figure 3

Temporal variability of the dissolved oxygen in the North Lagoon of Tunis at: (a) station 1; (b) station 2; (c) station 3; (d) station 4; (e) station 5; (f) mean concentration the five sampling stations

Surface water samples were analyzed in the laboratory for total phosphorus and total nitrogen. Samples for nutrient determination were collected in 1000 mL acid-washed polypropylene bottles and kept on ice until use.

Nutrient analyses were performed using a spectrophotometric method (Strickland and Parsons 1972) with a UV-visible spectrophotometer. Total phosphorus and total nitrogen were determined after alkaline peroxodisulfate digestion in an autoclave using unfiltered water. For the determination of total phosphorus, the phosphorus compounds were mineralized to orthophosphate ions in an autoclave at 100 °C using a solution of sulfuric acid and potassium persulfate.

Determination of total nitrogen compounds required high oxidation of the nitrogenous ions into nitrates in an autoclave using an alkaline solution of persulfate, then by the reduction of nitrates to nitrites by passing through a cadmium column. The nitrites formed were determined using sulfanilamide and N-naphthyl-ethylene.

Modeling Approaches

Analysis of Variance (ANOVA)

Analysis of variance ANOVA was performed to ascertain if there were any significant difference in physico-chemical conditions and in Chl-a concentrations among the sampling stations in the North Lagoon of Tunis. ANOVA was also conducted to verify any significant relationship between Chl-a and the physico-chemical parameters.

Random Forest

ML algorithms are typically implemented with a set of predictor variables (input variables) and one or more target variables (output variables) represented as a continuous value for regression problems (Kohavi and Jhon 1997) where maximizing accuracy is the main goal of predictive modeling (Motoda and Liu 2002). To estimate a parameter of water quality we can use all available predictor variables, or select a smaller number of them. This can result in the inclusion of too few or too many inputs to the model, both of which are undesirable (Maier et al. 2010). To address this issue, a predictor variable selection stage had been considered in this study to eliminate redundant data. The aim of reducing the number of predictor variables in ML is to speed up the process of the learning algorithm to boost predictive accuracy and increase the comprehensibility of learning results (Motoda and Liu 2002). There are many parameters that influence the concentration of Chl-a. This study performed a RF model to determine the appropriate predictor variables that are the most important to the Chl-a concentration. RF modeling is a relatively recent ML approach based on decision trees and trained on a collection of input predictor variables in order to obtain an accurate prediction of the output variable (Breiman 2001).

RF approach has many advantages. First, no probability distribution of predictor variables is assumed. Second, it can handle a large number of variables, selecting the most useful ones among them (Mulia et al. 2013; Park et al. 2015). Third, RF predictions are exceptionally accurate, since they come from the average set of many simple models, thereby avoiding the over-fitting problem typical of many non-linear regression techniques (Phillips et al. 2008; Huang et al. 2015). Fourth, Because each tree is built on a random subset of the original data, no separate independent dataset or cross-validation approach is required to test the predictive performance of the model (Were et al. 2015). Finally, RF technique is adequate for natural ecosystems where there is a large amount of physico-chemical and biological variables that have complex relationships.

An interesting aspect of the RF model is that it can provide a quantitative measure of the importance of the various predictor variables in the final result, which can be useful in choosing the most important ones.

The method used to evaluate the ranking of the most important predictor variables from the RF model is the out-of-bag (OOB) technique by permutation; a technique that measures how influential the predictor variables in the model are at predicting the response variable (Chl-a). The effect of the predictor variable increases with the value of this measure.

If a predictor variable has an effect on the prediction, then the permutation of its values should have an impact on the model error. If a predictor variable is not influential, the permutation of its values should have little to no effect on the model error (Mitchell 2011). It consists of calculating the gain in the mean square error, which is computed by permuting OOB data: for each tree, the prediction error on the OOB portion of the data is recorded; the same is done by permuting each predictor variable (Mitchell 2011). The differences between the two OOB errors are then averaged over all trees and normalized by the standard deviation of the differences (Mitchell 2011).

The mean of squared residuals (MSE) and coefficient of correlation (R) were used to evaluate RF model performance. MSE is a quantitative measure of the error acquired by the model when a prediction for the target variable is made. It can be sensitive to outliers and is best used in conjunction with other metrics when outliers are present to evaluate a given model (Cutler et al. 2007). If the MSE is close to 0, it indicates a very close approximation to the actual values. The MSE is defined as:

$$MSE=\frac{1}{N}\sum _{i=1}^{n}{\left({y}_{i}-\widehat{{y}_{i}}\right)}^{2}$$
(1)

where:

\( {y}_i \)and \( \hat{y_i} \)denote the modeled concentration and the observed concentration of Chl-a, respectively and n is the number of data in each data set.

Prediction accuracy R represents the degree of correlation between the prediction values and the observed values, and a high R value (close to 1) means the prediction is close to the observed value (Xu et al. 2019).

$$R={\left(1-\frac{\sum _{i=1}^{n}{\left({y}_{i}-\widehat{{y}_{i}}\right)}^{2}}{\sum _{i=1}^{n}{\left({y}_{i}-\stackrel{-}{{y}_{i}}\right)}^{2}}\right)}^{\frac{1}{2}}$$
(2)

where:

\( \overline {y_i} \) denotes the average of the observed values of Chl-a.

It is to be mentioned that the coefficient of determination (R2) was calculated from R to contribute in the assessing and comparing the performance of the models. Thus, useful information can be obtained concerning the relative importance of all variables and their capability of forecasting Chl-a concentrations.

The RF model was simulated twice. First, considering only the seven physico-chemical predictors we had for predicting Chl-a concentrations. Second, we checked whether for an observable spatial or seasonal dependence among Chl-a predictors. In other words, we checked whether the predictions of Chl-a concentrations could be improved by considering the sampling stations and the seasons as observed variables. This was done by including two new categorical predictor variables representing the five stations and the four seasons, respectively. This is possible because RF models can handle both quantitative and qualitative predictor variables.

Nonlinear AutoRegressive with eXogenous Inputs (NARX) Neural Network

ANN is a massively parallel-distributed information-processing tool that aims to simulate the functioning of brain neurons via a network of artificial neurons organized into layers (Haykin 1999). The network receives a stimulus and converts this input into a signal output via a transfer function (Jimeno-Sáez et al. 2018). Because of its ability to assign meaning to input parameters and to map the inputs to the outputs, the ANN model is an effective modeling technique when the relationships between the variables of the underlying physical processes are complex or uncertain (Wei et al. 2001). These neural networks are a non-linear modeling tool that can manage a large number of inputs to determine one or more outputs (Fogelman et al. 2006).

There are many types of ANNs for different applications. Generally, in ANNs, the direction of information flow between nodes or neurons, from the input to the output layer, and each node in a layer is connected to each of the nodes in the next layer (ASCE 2000). The neurons are linked to other neurons via links that have an associated weight that reflects their strength of connection and stores network information (Jimeno-Sáez et al. 2018). The important feature of the transfer function is that as input values changes, it provides a smooth, distinguishable transition. In other words, a minor change in the data, produces a minor change in the output (Jimeno-Sáez et al. 2020). This transfer function is used within a certain proper range to constrain the outputs of neurons (Xu et al. 2019). The most common threshold transfer functions include a linear function, nonlinear gradient descent function, stepwise function, and S-shaped function (Xu et al. 2019). In NARX, the transfer function for the hidden layer is an S-shaped (e.g. sigmoid) function, and the transfer function for the output is a linear function (Xu et al. 2019).

Then, the propagation function computes the input to a neuron from the outputs of its predecessor neurons and their connections as a weighted sum (Rajaee et al. 2019). A bias term can be added to the result of the propagation (Rajaee et al. 2019).

Therefore, the connection weights, biases, transfer and propagation functions parameterize the mathematical relationship between inputs and outputs of the network (Nguyen et al. 2007). These weights and biases need to be adjusted in the training process of the networks to minimize the model error (Jimeno-Sáez et al. 2020).

NARX model is a dynamic recurrent neural network that encloses several layers with feedback connections (Hayken 1999). It has previously been applied by many researchers to model nonlinear processes. In 1996, Lin et al. stated that NARX network is a powerful modeling and validation tool with a much faster convergence that generalizes much better than other ANNs. There are many applications for the NARX network in representing nonlinear dynamic behaviors.

A NARX model is defined as follows:

$$y\left(t\right)=f\left(y\left(t-1\right), y\left(t-2\right),\dots ,y({t-n}_{y}\right),u\left(t-1\right),u\left(t-2\right),\dots ,u\left(t-2\right),\dots ,u\left({t-n}_{u}\right)$$
(3)

where: f is generally unknown and can be approximated, u(n) and y(n) denote the input and output of the model at discrete time step n\(n\), respectively.

The first step in a NARX model is to determine the input and output variables. In our study, the output is the Chl-a variable and the input variables are those having the highest permutation importance according to RF model. To ascertain that, different predictor variables combinations were tested. The next step is to set up the network configuration, which consists of determining the number of neurons in the hidden layer and the number of time delays in the input layer to maximize modeling ability. The prediction accuracy (weights and biases) can be improved by adjusting these two parameters (Xu et al. 2019). There is no default criterion to determine the best structure, therefore we evaluated the performance of the network with different structures, after a training phase mostly based on the size of errors, such as checking the MSE and R, as in the RF model. In addition, the error autocorrelation function and input-error cross-autocorrelation function were also checked to evaluate the NARX performance. The autocorrelation error function describes how prediction errors are related in time. For a perfect model of prediction, the difference between the two errors should be small enough to be statistically insignificant. This would mean that the prediction errors are entirely uncorrelated with each other. This also means that the values of error autocorrelation should mostly be within a certain confidence interval of 95 % (Xu et al. 2019).

The input-error cross-correlation function describes how the errors are correlated with the input sequence. For the ideal prediction model, all correlations should be zero, except for the one at zero lag (Xu et al. 2019).

Three training algorithms, which are the fastest and most commonly adopted in NARX training were tested: Levenberg-Marquardt algorithm, the backpropagation algorithm and the Bayesian regularization algorithm.

Multiple scenarios with different predictor variables combinations (inputs) were tested to simulate the NARX model, and the one with the best performance were used to develop the network.

Results and Discussion

Data Properties

ANOVA revealed no significant difference among the sampling stations in the lagoon for any of the physico-chemical variables or Chl-a concentrations, which is in general good agreement with the observed data sets (except for Chl-a at only station 5; Figs. 2 and 3). Therefore, all data were grouped by months to reconstruct the monthly dynamics (mean of the five stations) of each variable in the North Lagoon of Tunis (such as in Figs. 2f and 3f).

In general, temperature, salinity and dissolved oxygen measurements show typical trends of the southern Mediterranean coastal marine climate (De Casabianca et al. 1991; Armi et al. 2012; Dhib et al. 2016). The temperature values indicate an apparent seasonal pattern. The minimum and maximum temperatures recorded in winter (January) and summer (July and August) varies between 8 and 30 °C, with an annual average of 19 °C. salinity also shows a seasonal trend with a minimum of 32.75 in July and a maximum of 42 in November, a mean of about 36.5 in the lagoon.

The concentrations of dissolved oxygen display a strong seasonal pattern, ranging from 5 mg/l in summer (July) to 11.5 mg/l in winter (December) in the lagoon. Concerning the eutrophication indicator (expressed as Chl-a), its trend shows an increase from winter (January) to summer (June) reaching its maximum monthly concentration of about 3.65 µg/l in the lagoon. The total phosphorus concentrations ranged from 4 µg/l to 65 µg/l in the North lagoon of Tunis. Total nitrogen monthly concentrations in the lagoon ranges between 200 and 1100 µg/l. Nutrient concentrations (total phosphorus and total nitrogen) exhibit a common seasonal variation, being relatively high in autumn and winter and low in summer. The monthly pH values ranged from 7.2 to 8.7 during our study period, in the North Lagoon of Tunis. A pH range of 6.5 to 8.5 is acceptable for aquatic biota according to the APHA (1999). So we can state that the North Lagoon of Tunis is in a good agreement with these limits.

Random Forest

Chl-a is one of the most relevant markers of water bodies’ presence and degree of eutrophication (Lu et al. 2016). In the North Lagoon of Tunis, Chl-a monthly mean concentrations range from a minimum of 0.22 µg/L and maximum of 3.65 µg/L. Chl-a levels can have complex relationships with both nutrient components (total phosphorus, total nitrogen) and water quality variables (salinity, pH, temperature, dissolved oxygen and Secchi depth) in coastal ecosystems (Jimeno-Sáez et al. 2020). The RF is a suitable technique from ML algorithms, when it comes to dealing with complex relations between variables. The RF model was trained on the North Lagoon of Tunis data-352 samples of 7 predictor variables (Secchi depth, dissolved oxygen, temperature, salinity, total nitrogen, total phosphorus and pH) and one target variable, the Chl-a. The implementation gave an R2 measure of about 0.62 and MSE equal to 0.28. Figure 4 shows the ranking of predictor variables according to their importance by OOB technique by permutation. Only a few descriptors contributed noticeably to the estimation of the Chl-a content namely, Secchi depth followed by the dissolved oxygen and pH.

Fig. 4
figure 4

Predictors importance ranking for the “first” RF model to predict Chl-a content in the North Lagoon of Tunis. The importance of each predictor is measured using the OOB technique by permutation due to each predictor

The most important pigment in aerobic photosynthetic organisms is Chl-a. The depth will impact the strength of sunlight in water and thus the photosynthesis of most algae (Frolov et al. 2012), explaining the strong correlation between the depth and Chl-a.

Algae produce oxygen during the day and absorb it during the night. Oxygen absorption also occurs during the process of algae death and decay (Béjaoui et al. 2016). In agreement with that, our findings have shown that dissolved oxygen is also associated with Chl-a concentrations.

In addition, several studies have demonstrated the strong correlation between Chl-a and pH (Menendez et al. 2001; Zang et al. 2010; Wallace et al. 2016).

In decreasing order of importance, the other predictor variables included in the RF model were: total phosphorus, total nitrogen, salinity and temperature.

A direct comparison (scatter plot) of the observed and predicted Chl-a concentrations is shown in Fig. 5. The fitted RF model was much better than the one reported by Béjaoui et al. (2016) for Bizerte lagoon (R2 = 0.51), and similar to the one reported by Béjaoui et al. (2018) for Ghar el Melh lagoon (R2 = 0.64). Both lagoons are located in the Mediterranean coast of north Tunisia. Hence, for the North Lagoon of Tunis, the observed Chl-a concentrations were more accurately predicted than those of Bizerte lagoon. It is known that for predictive modeling, the number of the observed data is very important for the accuracy of the model (Béjaoui et al. 2016). We used long-term of monthly observations that lasted approximately three decades in the North Lagoon of Tunis, which makes the results of the RF accurate in the studied ecosystem.

Fig. 5
figure 5

RF Prediction of the Chl-a concentrations using the physicochemical predictor variables data in the North Lagoon of Tunis. Predicted response is Predicted Chl-a values and True response is Observed Ch-a values

For comparison, a Multivariate Linear Regression (MVLR) model was fitted in addition to the RF model. The linear model parameters (Estimate) between predictor variables and Chl-a concentrations were almost consistent with the relationships found with the above RF model (Table 1). Both Secchi depth and dissolved oxygen were the two most important predictors explaining Chl-a concentration levels. Thus, the linear model quantitatively confirmed the outcomes of the RF model.

Table 1 Regression coefficients between Chl-a concentrations and physico-chemical variables in the North lagoon of Tunis using linear model (coefficients marked with (*) are statistically significant at p-value < 0.05)

MVLR gave an R2 of about 0.29. It’s obvious that the RF model captures more efficiently the dependency of Chl-a concentrations on other variables than the MVLR. The quality of the results is ensured by using the OOB procedure by permutation. We can thus confirm that the RF model could be used to better understand more complex dependencies among variables since it has several advantages over traditional correlative analyses (i.e. a decrease in outlier sensitivity, no implicit assumptions on data distribution).

According to the MVLR, the Chl-a concentrations had a significant correlation with water quality variables, as secchi depth, dissolved oxygen followed by total phosphorus in the study area. However, the weakness of its performance suggests that the use of traditional regression methods in the modeling of such a complex process is meaningless, so there is a great need to use more effective techniques (Mjalli et al. 2006).

This fact may support the conclusion drawn by Maier et al. (2010) that using a linear approach to define which of the potential input variables have a significant relationship with the model output is not sufficient for the development of ANN models.

All variables, were transformed, to normalize their distribution prior to any modeling analyses. However, the transformations did not improve the performance of the MVLR.

The relationship between all variables is strictly nonlinear, which is expected. Natural ecosystems are governed by a number of complex processes due to the impacts of hydro-climatic variables such as evaporation, temperature, precipitation, etc. and anthropogenic contribution (Schramm 1999; Viaroli et al. 2008).

In recent studies of similar lagoons located in the north of Tunisia, Chl-a concentrations were found linearly not related to the physico-chemical parameters (Béjaoui et al. 2016, 2018).

We performed all our modeling directly on the original data using ML techniques, known for their abilities to deal with non-linear complex time series processes. Fitting a model directly without transformation is advantageous for forecasting, because forecasts are returned on the original scale.

A second RF model was also fitted by adding two new categorical variables: station (for an observable spatial dependence) and season (for an observable seasonal dependence). The performance of the RF slightly decreases and attains an R2 of about 0.59.

The addition of a new predictor variable containing spatial information (station) appears to have little importance on the model simulating Chl-a concentrations (Fig. 6). The existence of a strong correlation between Chl-a concentrations and secchi depth showed that depth data itself might be sufficient for demonstrating the spatial variations of Chl-a concentrations in the lagoon without including the categorical variable (station). Moreover, for the RF model, the categorical variable (season) did not really had an important effect on Chl-a concentrations. This finding was expected since the variable (temperature) can interfere with the variable (season). Additionally, we may state that this can be due to the climate of Tunis, which stays relatively warm all along the year, including winter.

Fig. 6
figure 6

Predictor importance ranking for the “second” RF model to predict Chl-a content in the North Lagoon of Tunis. The importance of each predictor is measured using the OOB technique by permutation due to each predictor

RF is a good predictive technique to study the correlations between physico-chemical and/or biological variables in coastal ecosystems. Béjaoui et al. (2016), showed that mainly dissolved inorganic nitrogen (NO3) along with dissolved oxygen are the greatest contributors to Chl-a content in Bizerte lagoon. Furthermore, in 2018, Bejaoui et al. reported that temperature and silicates are the two most strongly correlated variables to the plankton dynamics in Ghar Melh lagoon. Although, the influence of the predictor variables of Chl-a were different in several research works, the dissolved oxygen and secchi depth generally were among the main variables.

For example, Palani et al. (2008) applied the ANN model with location variables, orthophosphates (PO4) dissolved oxygen and temperature as the explanatory variables to predict Chl-a concentration. Li et al. (2017) selected the concentration of total phosphorus and total nitrogen, temperature, secchi depth, and dissolved oxygen among the most influential input variables for Chl-a, using a genetic algorithm optimized back-propagation neural network. Furthermore, Kuo et al. (2007) defined the Chl-a model by the input of month, temperature, pH, secchi depth, suspended solids (SS), PO4 and NO3.

It is important to highlight that the difference in the RF results between the previous ecosystems, with the North Lagoon of Tunis can be explained by the ecosystem specificities as dimensions of water masses, different eutrophic states, water depth and communication with the sea. In addition, various modeling approaches, in addition to different field works and laboratory analysis techniques may have contributed to these differences.

Nonlinear AutoRegressive with eXogenous inputs (NARX) Neural Network

In this study, four scenarios with different input combinations of the predictor variables are tested for estimating and forecasting Chl-a concentration values in the North Lagoon of Tunis using the NARX network.

The first input scenario (S1) considered all parameters we had as inputs without selection. The second scenario (S2) included only the three most important predictor variables according to the RF model. The third input scenario (S3) included only the most highly correlated parameters according to the MVLR. the last scenario (S4) included only the two most important predictor variables according to the RF model. Summarized results of predictor variables selection is shown in Table 2.

Table 2 Summarized results of predictor variables selection

The NARX models with the four input scenarios described in Table 2 were developed to simulate the Chl-a concentrations. The four versions of each model represent four substantially different Chl-a models, due to the different combinations of variables used as predictors. Different ML models are compared based on statistical indices (Jimeno-Saez et al. 2020) such as R, R2, MSE, etc. These performance measurements are summarized in Table 3 for the NARX network. Different parameters are tried for each NARX model and the best one; with the minimum MSE and maximum R and R2 in selected for the forecast of Chl-a task.

Table 3 Performance of Chl-a estimation from NARX models based on four different input scenarios

The topology with 10 neurons in one hidden layer and 2 lags in the input variables provided the best performance in the prediction of Chl-a concentrations among all the scenarios. Our study considered that a proportion of 70 % training, 15 % validation and 15 % test is a favorable implementation.

The Levenberg–Marquardt algorithm, an extensively recognized training algorithm, was used for minimizing nonlinear functions. Training automatically stops when generalization stops improving, as indicated by an increase in the mean square error of the validation samples (Xu et al. 2019).

The comparative results between the four versions of the NARX model reveal that the NARX with two inputs selected by the RF algorithm yielded the best accuracy among all the developed NARX models in term of higher R and R2 and lower MSE values (R = 0.79; R2 = 0.62; MSE = 0.31).

With three inputs, selected also according to the RF model, the S2 scenario is the second most accurate model with a performance close to the best one. Because simulating time and over-fitting risks increase with the number of predictor variables in predictive modeling, a good practice is to create a model using as few predictor variables as possible (Jimeno-Saez at al. 2020).

We now present one step-ahead (a month) forecasting results for the eutrophication indicator considering the three datasets: Chl-a concentrations as the target, using secchi depth and dissolved oxygen as external inputs.

Most of the NARX model errors were very close to zero and fall within the confidence interval (Fig. 7), therefore the autocorrelation errors were negligible. In general, the input-error cross-correlation plot (Fig. 8) showed that all the correlations fell within the confidence interval.

Fig. 7
figure 7

The autocorrelation error plot

Fig. 8
figure 8

The input-error cross-correlation plot

Furthermore, the obtained results (Fig. 9) show an overall correlation of R = 0.79 between the actual data (targets) and the predicted values (the outputs). The Error histogram was checked. It presents a closely bell-shaped normal distribution of the errors (Fig. 10). Given that, we can conclude that, residuals of the NARX model are uncorrelated and normally distributed.

Fig. 9
figure 9

Correlation between original (target) and predicted (output) Chl-a values obtained with the NARX network

Fig. 10
figure 10

Error histogram of the NARX network

The model’s fitness is described in Fig. 11 and a visual comparison of Chl-a concentrations predictions with respect to the observed data is shown. There is a fairly good match between the observed values and the fitted values. The NARX network was able to predict the high variability of Chl-a concentrations, therefore, the fitted model seems to be mathematically accurate and the NARX could be used on a new data set. Given its effectiveness, multiple studies used neural networks techniques to model the Chl-a contents as an eutrophication indicator in coastal ecosystems. We can mention the study of Nazeer et al. (2017), who suggested, using ML methods, such as ANN for a more accurate and efficient routine monitoring of coastal water quality parameters, particularly Chl-a, in a coastal area of Hong Kong. In another study carried out in the Mar Menor lagoon in Spain, a Multilayer Neural Networks have been used for the eutrophication modeling, considering Chl-a as the eutrophication indicator (Jimeno-Sáez et al. 2020). In 2003, Lee et al. used back-propagation learning algorithm for training the ANN to predict the algal bloom dynamics of the coastal waters of Hong Kong using a 4-year set of phytoplankton abundance data. Lu et al. (2016), used a back-propagation ANN model for the prediction of Chl-a concentrations in lake Champlain in China.

Fig. 11
figure 11

Measured and simulated Chl-a concentrations using NARX network in the North Lagoon of Tunis

By comparing the NARX results (R = 0.79; R2 = 0.62) with the MVLR including just the two most important predictors for the Chl-a concentrations, the performance of the final linear model decreases further and attains an R2 of about only 0.2. We can thus confirm that the relationship between the variable predictors and Chl-a concentrations is obviously non-linear and the use of the NARX, RF and generally ML techniques, is adequate for forecasting Chl-a contents in the studied lagoon.

The forecasting of the Chl-a content one month ahead gave a value about 0.51 µg/L, which was close to the observed value (0.5 µg/L). These values are very close, which indicates a relative normal Chl-a level of monthly variation in the lagoon and the accuracy of the developed NARX model.

Conclusion

The ecological stability of the North Lagoon of Tunis makes it of a significant socio-economic and ecological values. Multiples services are provided in this ecosystem, such as in tourism (water sports), in fisheries, and in the conservation of sea birds. Thus, it is necessary to improve our understanding of the eutrophication process and of the interactions among the water quality parameters in the lagoon, to adopt sustainable management strategies. One of the most important indicator of the presence and degree of eutrophication in water bodies is the Chl-a content.

The approach proposed in the current study relies on a combination of ML methods, using NARX neural network and RF model to predict and forecast Chl-a concentration in the North Lagoon of Tunis. Mainly secchi depth along with dissolved oxygen are the greatest contributors to this eutrophication assessment and forecasting. Our results agree well with findings from other studies carried out on Mediterranean coastal lagoons. It’s worth mentioning that, secchi depth and dissolved oxygen are very practical variables to measure, without the need of extra laboratory analysis.

The NARX developed was able to predict Chl-a concentration dynamics fairly well using minimal input predictor variables. Our results show that complex behavior in the eutrophication process could be modeled using the NARX technique and, some extreme values were successfully estimated.

The results confirmed the significance and usefulness of intelligent modeling as a tool that is simple, rapid, easy to operate and not costly. The developed model can be used to (1) estimate Chl-a concentrations when the real value is not available and to (2) simulate different water quality scenarios for extreme ranges of input and output parameters.

It is important to mention that despite the important amount of the observed data (approximately three decades) used for developing the NARX, it has a very short computational time.

In the wider context of the study of coastal lagoons and other transitional ecosystems, our approach could be used to assess and predict the eutrophication process of these natural environments and help in decision-making by civil authorities, as well as by engineers, economists, investors and other interested stakeholders.

As perspectives, in order to improve the accuracy of the model we would suggest, adding more data either by simulation (in interpolating the available data), or ideally by performing daily or weekly measurements, at least for the most important parameters (Chl-a, dissolved oxygen, transparency, nutrients), maybe not at all the stations but at a minimum at the two stations 2 (in the north zone of the lagoon) and 5 (in the south zone of the lagoon).