Introduction

Among the most common lake/reservoir problems is harmful cyanobacterial blooms (HCBs), the consequence of excessive release of nutrients and pollutants from anthropogenic activities (Paerl 2017). HCBs have been recognized as an emerging issue, causing a broad range of environmental, social, and economic damage (Lee et al. 2015; Schaeffer et al. 2018). HCBs, for instance, may lead to incidents of hypoxic or anoxic conditions causing mortality (Chorus and Bartram 1999). Surface blooms have caused the degradation of water quality and have had negative effects on recreational opportunities and the economy (Paerl and Huisman 2008). HCBs are also well known for their toxic secondary metabolites, known as cyanotoxins, including hepatotoxins, neurotoxins, and dermatotoxic compounds. These toxins have had detrimental effects on higher trophic levels, mortality, and illness in aquatic animals as well as adverse health risks to humans (Pham and Utsumi 2018). The impacts of HCBs on human life have been exacerbated as a result of eutrophication and global warming (Paerl and Paul 2012; Visser et al. 2016).

It is, therefore, crucial to monitor and understand the spatial and temporal variations of Chl-a concentration, representing HCBs, in the raw water storage units. Recently, remote sensing has been strongly suggested as a practical approach not only for Chl-a long-term observation but also water quality analysis, mainly because of its capability to capture synoptic data of a large area during the algal bloom (Bresciani et al. 2018). This new approach contrasts with the traditional field-based methods, which are usually costly, labor intensive, and have a low frequency of in situ measurement (Mu et al. 2019; Quang et al. 2017; Schaeffer et al. 2018). An accurate remote estimation of Chl-a concentration in turbid productive waters is essential for large-scale and multi-temporal studies. However, the deficiency of appropriate satellite sensors and Chl-a retrieval model have left researchers with unresolved challenges (Le et al. 2009; Mishra and Mishra 2012; Toming et al. 2016).

Numerous studies have used satellite data to monitor the occurrence of algae blooms in coastal and inland waters, most of which follow models based on the correlation between the inherent optical properties (or apparent optical properties) and the water quality parameters (Bresciani et al. 2018; Lins et al. 2017; Quang et al. 2017; Zhang et al. 2016). It has been recognized that, as a result of the complex interaction between the inner and outer constituents, the variation of Chl-a concentration in water usually results in a nonlinear relationship between phytoplankton abundance and a group of water quality, hydrology, and meteorology factors (Lou et al. 2016; Yi et al. 2018a). Moreover, due to the presence of multiple constituents such as detritus, non-algal particles (NAPs), and colored dissolved organic matter (CDOM), the use of remote sensing for monitoring Chl-a in inland waters has been far less successful compared to their application in open oceans (Chen et al. 2017; Li et al. 2018; Liu and Tang 2012). To overcome such limitations, the local-based satellite band ratio has been preferred, or most recently, the advanced machine learning methods have been contributing various practical models to Chl-a retrieval in lakes/reservoirs.

In general, the input data of machine learning models consist of either remote sensing reflectance (as reviewed above) or water quality parameters. Normally, the latter approach with water quality data inheres in the apparent advantages due to the certainty of water sampling and analysis, which consequently assures the accuracy of the input data for the model’s performance. The selected research papers in this group include artificial neural network (ANN) with back propagation and/or support vector machine regression (SVR) (Chen et al. 2017; Kown et al. 2018; Park et al. 2015; Wang et al. 2018; Xie et al. 2012), principle component analysis and multivariate linear regression (Keller et al. 2018); more recently, extreme learning machine has considerably contributed to research in this field (Lou et al. 2016; Yi et al. 2018a). In most cases, SVR is preferred, mainly because of its proven advantages with an accepted accuracy for both training and test phase (Karamizadeh et al. 2014). To reduce the amount of input data, Li et al. (2018) applied a minimum redundancy/maximum relevance (mRMR) and random forest to select the key factor for random forest and support vector machine models.

In Vietnam, HCBs occur consistently and at a higher frequency in both inland and coastal waters. However, the prediction of HCBs has only been examined using physical and band ratio-based models from the satellite data (Dippner et al. 2011; Ha et al. 2017a; Ha et al. 2017b; Liu and Tang 2012; Tang et al. 2004). The machine learning methods have seldom been used for this task, despite their proven performance (Blix et al. 2017; Blix and Eltoft 2018b; Bui et al. 2017; Keller et al. 2018). In addition, the question of using a linear model for Chl-a prediction remains valid in case of complex optical properties of water. Hence, there is a notable deficiency of knowledge about bio-optical variability in freshwater systems, which is exploiting machine learning.

Using Tri An as a typical case study for eutrophic deep reservoir, a detailed assessment of machine learning and satellite band ratio regression approaches was performed to evaluate the predictabilities of diverse ensemble models. The aims of this study are as follows: (a) to predict Chl-a concentration using band ratio regression, extracted from remote sensing data and machine learning algorithms, exploiting water quality parameters and comparing their results to recommend the one with better performance and (b) to analyze the spatiotemporal variation and elucidate the mechanism of HCBs in TAR. It is hoped that this work will contribute to an initial assessment of the variability of HCBs in the highly turbid reservoirs of Vietnam.

Methods

Study site description

The Tri An Reservoir is one of the biggest reservoirs in Vietnam, located in Dinh Quan district, Dong Nai province, within a quadrat bounded by 11°05′–11°17′N, 106°58′–107°16′E (Fig. 1). The reservoir is designed for multiple purposes, involving drinking and industrial water supply, agricultural irrigation and fisheries, recreational and tourist resources, flood control, and hydropower operation. Its surface area, maximum depth, mean depth, and volume are respectively 320 km2, 27 m, 8.5 m, and 2.7 billion m3. The annual mean values of rainfall, air temperature, and wind speed are roughly estimated to be 2400 mm, 33 °C, and 9 m/s, respectively. During the past decade, a high frequency of HCBs has been recorded, dominated by Microcystis and Anabaena colonies with the presence of cyanotoxins (Dao et al. 2016). Based on the data of TN (0.25–1.3 mg/L) and TP (0.05–0.14 mg/L) concentrations, TAR falls into the eutrophic category.

Fig. 1
figure 1

Location of the eutrophic Tri An Reservoir and the five monthly sampling stations (yellow triangulars)

Water parameters measurement

The data on water quality were collected monthly from five monitoring stations at a 2-m depth from April 2016 to February 2018 (Fig. 1). The samples were preserved with ice in the field until further processing in the laboratory on the same day. Water pH, temperature, and DO were measured in situ with a multi-detector (WTW Multi 3320, Weilheim, Germany), while Secchi disk was used for determining transparency. To identify Chl-a fraction in water samples, a known volume of raw water samples (100–300 mL) was filtered through glass-fiber filters (Whatman GF/C, England), then Chl-a was extracted using 90% acetone overnight in the dark at 4 °C. After centrifugation, Chl-a concentration was measured at 630–750 nm using a spectrophotometer (UV-VIS, Harch, 500) and calculated using the trichromatic equations (APHA, 2005).

Chemical parameters were analyzed colorimetrically in triplicate with a spectrophotometer (Hach DR/2010) using the following APHA (2005) methods: nitrate 4500NO3 (B), phosphate 4500PO43− (B), total nitrogen Kjeldahl, 4500 N (C), and total phosphorous 4500P (D). To measure the total suspended solids (TSS), 300–400 mL of raw water samples were filtered into a pre-weighed glass-fiber filter and dried completely at 95 ± 5 °C. The TSS concentration was estimated gravimetrically. In addition, the monthly rainfall and wind speed data, published by the Southern Regional HydroMeteorological Center (Vietnam), were collected in order to elucidate the driving factors, generating a high Chl-a concentration variation and HCBs mechanism in TAR.

Image processing

In this study, band ratio-based model was developed using Sentinel 2A/B and in situ Chl-a data. The Multispectral Instrument (MSI), launched on 23 June 2015 for 2A and 07 March 2017 for 2B, was a filter-based push-broom-type imager, acquiring imagery every 5 days. The MSI sensor observes the Earth at 13 spectral bands, spreads over the VNIR and SWIR domains (443–2190 nm) with spatial resolutions, ranging from 10 to 60 m (Gascon et al. 2017). Level-1C, orthorectified georeferenced, and radiometrically calibrated to Top-Of-Atmosphere (TOA) reflectance image was downloaded from Sentinels Scientific Data Hub and performed on the Sentinel Application Platform (SNAP) version 6.0 on Windows 10 (64-bit). In particular, a series of 23 cloud free images (Fig. 2), acquired from November 2015 to February 2019, were used to develop the model and analyze the spatial patterns of Chl-a concentration. In this study, cloud cover was estimated for the whole image with no cloud above the water body (Fig. 2). After cloud masking and removal using ArcGIS, the cloud-free time series were used to process next steps.

Fig. 2
figure 2

Date acquisition and cloud cover (%) of obtained Sentinel-2A/B images

Atmospheric correction was carried out in order to remove the noises from the aerosol particles in the atmosphere. There are several atmospheric correction methods, including Sen2Cor, 6SV, ACOLITE, DOS (Dark Object Subtraction), and ATCOR, and the evaluation of the best atmospheric corrections is still ongoing in the scientific community (Martins et al. 2017; Chen et al. 2017; Sola et al. 2018). In this study, we employed Sen2Cor to perform correction of atmospheric effects, since it commonly outperforms in the highly turbid waters (Grendaitė et al. 2018; Mueller-Wilm et al. 2018; Sola et al. 2018). Furthermore, the ATCOR algorithm-based Sen2Cor processor has recently been renovated to improve accuracy for deriving the surface reflectance over water by using the surfaces of the Climate Change Initiative Land Cover (Mueller-Wilm et al. 2018). Hence, Sen2Cor was used to calibrate TOA reflectance to surface water reflectance (Rw). Then, the Rw values from band 1 to band 7 were used as input for the band ratio model approach (Fig. 3a).

Fig. 3
figure 3

The surface water reflectance (Rw) calculated by Sen2Cor in the Tri An Reservoir (a). The correlation between reflectance spectra calculated by Sen2Cor and Sentinel-2 Level-2A atmospherically corrected data supplied by the ServHub (b)

We do not have in situ reflectance measurements from the reservoir under investigation carried out simultaneously with the Sentinel-2 overpass. In order to validate the results obtained by Sen2Cor method, we used Sentinel-2 Level-2A atmospherically corrected data commenced from the Open Hub on 2 May 2017 and subsequently on the ServHub (Adriana and Richard 2017). In total, 60 points were randomly extracted from two images Level-2A on Jan. 13, 2019, and Jan. 28, 2019 to validate the robustness of atmospheric correction in TAR. The correlation between surface water reflectance calculated from Sen2Cor and the one extracted from Level-2A atmospherically corrected data supplied by the ServHub indicates a very good atmospheric correction for water pixels in TAR (Fig. 3b).

Band ratio regression model development

Several studies have discussed the time gap between in situ measurements and the satellite overpass, indicating a maximum of ± 8 days is reasonable in case of the stationary condition of the water environment (Johnson et al. 2013; Tan et al. 2017; Maeda et al. 2019). A 5-day lag, therefore, was acceptable to develop Chl-a retrieval algorithm in the present study, since the cyanobacterial blooms in TAR have extended for several weeks (based on field observation).

To eliminate the distortion on water surface reflectance, the average value using a 3 × 3 pixel box, centered on each sample station, was calibrated to perform a direct comparison with the in situ measurements (Quang et al. 2017).

To select the best satellite bands for the band ratio model, the statistical relationship between a color index (i.e., band ratio) and the in situ measurement of Chl-a was tested (Chen et al. 2017; Grendaitė et al. 2018; Mishra et al. 2017). Among the satellite bands, the two- and three-band empirical models using reflectance in red and near-infrared (NIR) spectral regions have commonly been reviewed in inland waters. Therefore, a wide range of spectral band from blue to NIR regions was adopted to develop the linear model for Chl-a retrieval in this study. A total of 22 observations which were below 5 days different from satellite overpass (roughly 32%) were carried out for model development. The remaining 47 observations (roughly 68%) were selected to validate the model’s performance.

Machine learning regression model performance

In total, 117 observations of nine (9) water quality parameters, including total phosphorous (TP), total nitrogen (TN), total suspended solid (TSS), nitrate (NO3), phosphate (PO43−), pH, temperature (temp), transparency (trans), and dissolved oxygen (DO), were considered input data for machine learning models. Due to a wide variation of water quality values, Chl-a concentration, and the difference in measurement unit, all the parameters were log-transformed to keep a normal distribution of the input data. The inputs for the model were selected from a correlation analysis with a coefficient of Pearson correlation higher than 0.3. In this research, four machine learning methods and one multivariate linear regression model (MLR) were compared to test their performance in terms of Chl-a prediction. The machine learning models include Gaussian Processor Regressor (GPR), Random Forest Regressor (RFR), Support Vector Machine Regressor (SVR), and Multi-layer Perceptron Regressor (MLP). Details about the algorithms and their operation can be found at https://scikit-learn.org/stable/ and in the research paper (Pedregosa et al. 2011). The performance of five models was adapted in Python environment using scikit-learn library within a two-step processing. At first, the hyper-parameters were tuned using a grid search with a fivefold cross-validation for the RFR, SVR, and MPL models. In case of the GPR model, various kernels were randomly tested to select the best kernel that corresponded to the dataset in this research. Then, the performance of the involved models was tested through the training and validation phases in a 10 cross-validation using Shuffle Split technique for the sampling. For the total dataset, 60 and 40% were respectively divided for training and validation phases.

Evaluation criteria

The performance of the Chl-a retrieval models in TAR was evaluated using the following indicators: square of correlation coefficient (R2), which provides the variability measure for the data reproduced in the model; root-mean-square error (RMSE) and mean absolute error (MAE), which measure residual errors, providing a global idea of the difference between observation and modeling; explained variance score (EVS) returns a score for the explanation of the variance of the measured and modeled values, and Bias explains the discrepancies between the measured and simulated Chl-a concentration. See Eqs. (1), (2), (3), (4) and (5) below for the formulas.

$$ {R}^2=1-\sum \frac{{\left({x}_i^{measured}-{x}_i^{estimated}\right)}^2}{{\left({x}_i^{measured}-{x}_{mean}^{measured}\right)}^2} $$
(1)
$$ \mathrm{RMSE}=\sqrt{\frac{\sum \limits_{i=1}^N{\left({x}_i^{estimated}-{x}_i^{measured}\right)}^2}{N}} $$
(2)
$$ \mathrm{MAE}=\frac{1}{N}{\sum}_{i=1}^N{\left({x}_i^{estimated}-{x}_i^{measured}\right)}^2 $$
(3)
$$ \mathrm{Bias}=\frac{\sum_{i=1}^N{x}_i^{estimated}-{x}_i^{measured}}{N} $$
(4)
$$ \mathrm{EVS}=1-\frac{Var\left({x}_i^{measured}-{x}_i^{estimated}\right)}{Var\left({x}_i^{measured}\right)} $$
(5)

Results

The variation of water quality parameters

The monthly mean and standard deviation of water quality variables from March 2016 to February 2018 in TAR are shown in Fig. 4. The surface water temperature in TAR did not vary much over the sampling period, ranging from 27.1 to 32.5 °C; however, pH largely changed from 6.0 to 9.0. Transparency exhibited a wide variation from 36 to 191 cm among five sampling sites, which was a similar trend to TSS (from 0.05 to 3.92 mg/L). The mean dissolved oxygen ranged from 4.4 to 7.2 mg/L. TAR is nitrogen eutrophicated with a higher and wider range of nitrate (0.17–0.6 mg/L) and TN (0.47–8.4 mg/L) compared to phosphate (0.04–0.14 ml/L) and TP (0.09–0.57 mg/L), respectively. A wide range of Chl-a (mean value varied from 15.48 to 1310 μg/L) was observed during the sampling period. The highest mean Chl-a concentrations were recorded during May to November and reached several peaks in June (360 μg/L), September (500 μg/L), and November 2016 (1310 μg/L) as heavy blooms occurred (Fig. 5a–c). A repeated cycle began at a low concentration in the dry months (Fig. 5d), with a minimum of 31 μg/L in January 2017 followed by an increasing in May 2017 (Fig. 4).

Fig. 4
figure 4

Mean values of water quality parameters in the Tri An Reservoir (from Mar 2016 to Feb 2018). Data were presented as mean values ± SD

Fig. 5
figure 5

Heavy bloom of cyanobacteria in June (a), September (b), November, 2016 (c) and water without bloom (d) from the Tri An Reservoir

Algorithms for Chl-a concentration retrieval

Band ratio-based linear regression model

Through all the cross-regression analysis, the relationship between Chl-a and blue-green ratio was low, approximated by linear function. By following the same principle of the blue-green ratio model and considering the ratio between reflectance in the near-infrared and reflectance in the red region, various algorithms were tested for Chl-a retrieval in this study (Table 1). As noted previously, the linear regression-based algorithms for Chl-a retrieving from remote sensing data are preferred (Grendaitė et al. 2018; Ha et al. 2013; Ha et al. 2017a; Lins et al. 2017; Ritchie et al. 2003). This means that, among various models including linear, quadratic, and exponential functions with similar R2 and RMSE, the linear regression algorithms will be the optimal selection. For this reason, the model No. 3 with R2 of 0.72 and RMSE of 5.95 μg/L (highlighted in italic) will be used as Chl-a prediction model (Figs. 6 and 7). It is clear that the estimated Chl-a has a small RMSE value compared to the mean in situ Chl-a (varied from 15.48 to 1310 μg/L), confirming the appropriateness of our model for estimating Chl-a in TAR when the acquisition times differ. Despite a better performance (R2 > 0.74) of the ratio B3/B6 vs. Chl-a concentration found in this study, this model was not exploited for Chl-a retrieval due to high RMSE (187.03 μg/L).

Table 1 Regression models for chlorophyll-a retrieval (B1 = 443 nm, B2 = 490 nm, B3 = 560 nm, B4 = 665 nm, B5 = 705 nm, B6 = 740 nm, and B7 = 783 nm)
Fig. 6
figure 6

Scatter plots comparing satellite retrieved and observed Chl-a using 22 training data observations

Fig. 7
figure 7

Observed and simulated Chlorophyll-a concentration for the training dataset of the prediction model

The performance of our new linear model (No. 3, Table 1) was then validated using 47 in situ measurements of Chl-a concentration, retrieved during field campaigns performed synchronous to satellite overpasses, with a maximum of 5-day time difference (Fig. 8). It was also noted that 47 these observations were different from those in the training phase (22 observations). This study demonstrated that the Green - NIR band ratio model was successful in the prediction of Chl-a concentration in TAR, with R2 greater than 0.70 and with very small values of RMSE and Bias. These results attest to our model’s ability to handle the relationship between water surface reflectance and Chl-a concentration.

Fig. 8
figure 8

Observed and simulated Chlorophyll-a concentration for the validation dataset

Water quality parameters-based machine learning for Chl-a retrieval

Input feature selection

In this paper, a feature selection was conducted using the correlation analysis to reduce the inputs from nine (9) to five (5) water quality parameters. Despite a good option for this selection through the OOB (Out Of Bag) score of random forest model, as suggested by Li et al. (2018), the correlation analysis between the log10 Chl-a and water parameters was used to create homogeneous inputs for all the compared models (Table 2).

Table 2 Pearson correlation coefficients of transformed water quality parameters and Chl-a concentration. Boldface type indicated high correlation of Chl-a with other factors

The data for TAR presented a high correlation between total nitrogen and Chl-a concentration with a coefficient of 0.86. The total phosphorous and total suspended solid establish a lower relationship with Chl-a (0.43 and 0.57, respectively) (Table 2). On the other hand, pH and temperature help to explain the variation of Chl-a. Nonetheless, this contribution is not significant with a low coefficient of Pearson correlation. As described in the methodology section, the water quality parameters with a Pearson correlation coefficient > 0.3 were selected. Hence, TP, TSS, TN, pH, and temperature are considered the input for the models of Chl-a prediction.

Model performance and comparison

The tuned parameters and selected kernel for machine learning models (Table 3) were preserved during the cross-validation running. Five water quality parameters, involving TP, TN, TSS, pH, and surface water temperature, were selected as the input to train and validate the linear and machine learning models. The best model for Chl-a prediction was a model with (a) the highest values of R2, EVS and the lowest values of RMSE, MAE (Table 4), and (b) the closest permutation of training and test scores in the learning curve (Fig. 9). In addition, the model was perceived better, as it presented a smaller standard deviation of the score in the learning curve.

Table 3 The best parameters and kernel for machine learning models
Table 4 Validation scores for log10 Chl-a prediction from water quality parameters
Fig. 9
figure 9

Learning curves of machine learning and multivariate linear models. The light red and green areas indicate the standard deviation of the learning curves. The more narrow area determines a better performance of the model. Score is coefficient of determination of the learning process. The red and green lines are the mean value of the score

As presented in Fig. 9, the learning curves indicated a similar performance of the GPR, MLR, and MLP models, although the slight differences were observed. For the mentioned indicators, GPR was determined as the best model for Chl-a prediction from water quality parameters. This model was able to explain more than 79% of Chl-a variation at the study site with the lowest RMSE (3.06 μg/L) (significantly lower than the mean of Chl-a value of 26 μg/L) (Table 4). The stability and outperforming of the GPR model were also confirmed by a small standard deviation and closed permutation of the training and cross-validation score. The MLR model also demonstrated a good performance with a high R2 and low RMSE (0.79 and 3.09 μg/L, respectively). However, the test score had a wider standard deviation compared to the GPR model (Fig. 9). The SVR and MLP models were both good in the permutation of the test score. MLP was stable during the learning process (Fig. 9); however, it was less accurate than the GPR and MLR models. Conversely, RFR was the only model that the training and test scores were not permuted during the learning, although the coefficient of determination reached 0.72 (Fig. 9).

According to Table 4, our results presented an accepted accuracy for Chl-a prediction with all the compared models. The linear and machine learning models were rational to explain a wide range of Chl-a variation (roughly 6–4600 μg/L) in TAR with the highest coefficient of R2 (0.79) and the lowest RMSE (3.06 μg/L) of the GPR model. This RMSE was significantly lower than the mean value of Chl-a (26 μg/L), which determined the outperforming of the model for Chl-a prediction. Nonetheless, all the involved models shared a slightly large standard deviation of the validation scores due to a wide range of Chl-a concentrations in TAR. The comparison of the model’s performance also answered the question in the introduction of this research, in that the multivariate linear model still works well in the case of the existing cause-effect relationship between the input (water quality) and output (Chl-a concentration) parameters. The previous analyses illustrated that the variation of Chl-a concentration can be interpreted by the key factors, such as TSS and TN in TAR.

Spatiotemporal variation in Chl-a concentration from 2015 to 2019

Totally, 23 Sentinel-2A/B scenes that were acquired under the lowest cloud coverage conditions from the late 2015 to the early 2019 were applied to the linear model developed above to clarify the spatiotemporal variation in Chl-a concentration. These images were classified into rainy (May–October) and dry seasons (November–April). To study the spatial distribution in different seasons, retrieving Chl-a from the pixels of the Sentinel-2A/B was taken into account. Then, the relationship between Chl-a concentration and its frequency was obtained in the rainy and dry seasons (Fig. 10). In both seasons, Chl-a values significantly varied from 6.46 to 4626.02 μg/L. However, the difference in the frequency of Chl-a concentration between the two seasons was only exactly in the range from 8 to 400 μg/L. Chl-a ranging from 20 to 60 μg/L was the most frequent value in the rainy season with the highest frequency of 39%. In contrast, the most frequent value of Chl-a in the dry season was around 30 μg/L with a frequency of 33% (Fig. 10).

Fig. 10
figure 10

Frequency distribution of Chl-a in the Tri An Reservoir by dry and rainy season

The distribution maps determining the specific regions of high Chl-a concentration in TAR are shown in Figs. 11 and 12, corresponding to the rainy and dry seasons. Due to high cloud coverage, there were not enough satellite images for a continuous, long-term observation. It is, therefore, a challenge to generalize the overall variation of Chl-a in TAR. In the maps of 10 scenes in the rainy season and 13 scenes in the dry season observed from 2015 to 2019, the patterns of Chl-a distribution were not homogeneous but mixed by high and low areas of Chl-a, due to the influences of meteorological and hydrological parameters. In both seasons, most areas of TAR suffered from HCBs with a high Chl-a concentration, ranging from 20 to below 5000 μg/L. The blooms were directly observed during the field campaigns on April 23, 2016 and September 20, 2016, which are consistent with the patterns in Figs. 11 and 12. Algae tends to bloom more strongly in the south and southwest areas, which are the downstream of TAR. It is worth noting that there are two major rivers located in the nearby station, TA5 (Fig. 1), discharging water into the reservoir especially after heavy rains, which implies that HCBs usually occur in the downstream compared to the north of the study area. In the rainy season, Chl-a concentration tends to be higher from the early days of June to the end of the season (i.e., October), and these values are specific to the dry season. Conversely, higher values were observed in the early days of the dry season (November to early February), and Chl-a values began declining in the following months (from March to June). It is clear that the most serious blooms first occurred in the southern or central part of the reservoir and gradually moved toward the southwest and northwest areas (pair images of September 5, 2017 vs. September 20, 2017). The higher values were in the southern and southwestern parts, while the lower values were in the southeast and northeast areas of the reservoir. During the years from 2015 to 2019, the maximum bloom area occupied more than 50% of the total area.

Fig. 11
figure 11

Spatiotemporal distribution of HCBs in the Tri An Reservoir in rainy season

Fig. 12
figure 12

Spatiotemporal distribution of HCBs in the Tri An Reservoir in dry season

With a R2 of 0.72, the linear model explains with great certainty the general variation of Chl-a in TAR in both the dry and rainy seasons. Nonetheless, a small area was recorded with a very high Chl-a concentration (over 5000 μg/L) in the narrow corner of the reservoir. These values may arise from very high turbidity and/or strong effects of bottom reflectance in shallow waters, leading to the anomalous values of the surface water reflectance in bands 3 and 6 of Sentinel-2 imagery.

Discussion

Similarity to other studies

The optimal position of the band determined for Chl-a estimation algorithm in this study is consistent with the finding of other research papers. Spectral bands beyond 650 nm are appropriate for the development of Chl-a retrieval models for inland waters, particularly in case 2 waters where Chl-a concentration is above 10 μg/L (Richardson and LeDrew 2006; Le et al. 2009; Mishra et al. 2017). Especially, in case of large variation of Chl-a concentrations, Zimba and Gitelson (2006) proved that the wavelengths of 650 nm, 710 nm, and 740 nm were the optimal choice. The specific wavelength varies with water constituents and their optical properties (Le et al. 2009). These findings demonstrate that the red-near-infrared wavelengths are the appropriate regions for the development of Chl-a retrieval algorithms in TAR.

To our knowledge, this is the third study using remote sensing for Chl-a concentration in the lakes/reservoirs in Vietnam and the first in the southern area. Our model (R2 of 0.72) outperformed the model applied in Ba Be lake with an R2 of 0.68. It is worth noting that TAR exhibited a large variation in Chl-a (mean value from 15.48 to 1310 μg/L), compared to a very small range from 1.58 to 6 μg/L in Ba Be water case (Ha et al. 2017b). The other study used Landsat 8/OLI two bands ratio algorithm for Chl-a concentration mapping in West Lake in Hanoi with a range from 42 to 258 μg/L (Ha et al. 2017a). However, this study used the exponential function for Chl-a retrieval with an archived R2 from 0.64 to 0.82, which may result in uncertainties compared to linear regression method applied in our paper (Grendaitė et al. 2018; Ha et al. 2013; Lins et al. 2017; Pham et al. 2019; Quang et al. 2017).

Machine learning with other case studies

The results of this study also support the practicality of using machine learning models for the retrieval of bio-optical parameters. Compared to other research papers, additional results were found in the case study in TAR. For log-transformed data, a high correlation coefficient with Chl-a was detected for TN (0.86) and TSS (0.57), which contrasts with various observed datasets (Li et al. 2018; Lou et al. 2016). In addition, the GPR model outperformed other competing models and provided a novel solution to the task of Chl-a prediction in the freshwater environment. Compared to the application of random forest (Li et al. 2018), extreme learning (Lou et al. 2016), ANN and support vector machine (Park et al. 2015; Xie et al. 2012), and M5P model tree and despite a lower value of R2 (Yi et al. 2018b), GPR still shows very good RMSE and MAE values in the case study of TAR. A similar outstanding performance of the GPR model was also identified for biophysical parameter retrieval, particularly for the oceanic Chl-a estimation (Blix and Eltoft 2018a; Blix and Eltoft 2018b; Verrelst et al. 2012). More interestingly, the multivariable linear model proved itself as a profitable predictor for Chl-a variation when a cause-effect relationship exists between the inputs (water quality parameters) and the output (Chl-a concentration). This performance is worthy of attention, mainly because of the simplicity and low consumption of the computer power of the model, compared to other machine learning approaches.

In comparison with the linear model for Chl-a retrieval from the remotely sensing data, the machine learning approach using water quality may not be an optimal method for monitoring the HCBs in TAR in spite of a higher R2 and lower RMSE compared to the band ratio regression models. This issue emanates from the limitations of the machine learning model in practice, which requires substantial amounts of input data, has complex inherent algorithms with various parameters, and poses the challenge of precisely interpreting the results from the applied models. In addition, the unavailability of water quality data may further complicate model validation and prediction for the early warnings of HCBs. As a result, our proposed linear model may be construed as an optimal selection for the further monitoring of HCBs in inland waters of Vietnam.

Mechanisms of HCBs in TAR

The blooms of harmful cyanobacteria in inland waters are influenced by multiple factors, including, but not limited to, light, temperature, turbidity, precipitation, wind speed, water residence time, and nutrient composition (Mu et al. 2019; Paerl 2017). In Lake Vancouver (Canada), phosphate was found to be the key factor regulating HCBs and toxins concentrations (Lee et al. 2015), whereas high nutrients, low water clarity, and warmer surface temperatures were identified as the three most influential environmental factors correlated with cyanobacterial composition in the US lakes and reservoirs (Beaver et al. 2018). High water temperatures have been known to trigger the development of cyanobacterial bloom in temperate zones (Imai et al. 2008; Kosten et al. 2012), while the occurrence of HCBs in tropical areas has been reported to be regulated by many factors (Bui et al. 2017; Pham et al., 2017).

In TAR, Chl-a concentration was strongly correlated with TN, TSS, and, to a less extent, TP; however, it was negatively correlated with transparency. It is difficult to draw strong conclusions on blooming. However, an overall trend is evident since the rainy months and the early dry months (from May to November) tend to correspond to the highest bloom frequencies. The discharge of nitrogen-rich water, as dominated by the intensification of agriculture and land runoff in the rainy season from the TAR catchment area, is very likely to create a favorable condition for the development of HCBs. It is suggested that cyanobacteria blooms in TAR are attributed to the conditions of high nutrient concentration and low transparency in the rainy season.

The relationships between the monthly mean Chl-a concentration and climatic parameters measured at the Tri An station are shown in Fig. 13. In general, heavy rain will lead to a temporary increase of nutrients, which is beneficial for algae blooms (Mu et al. 2019). The results show that monthly precipitation exhibited statistically significant positive correlation with average monthly Chl-a concentration (R2 = 0.54, p < 0.01). This suggests that monthly precipitation has a strong effect on blooms in the short term in the TAR. This is consistent with observations of Mu et al. (2019) in the Dianchi Lake, China. Followed by the rainfall, the monthly mean solar irradiance is also considered as a contributor to HCBs in TAR (R2 = 0.36, p < 0.01). In contrast, wind speed and air temperature have less impacts on HCBs with R2 of 0.16, and 0.13, respectively (Fig. 13).

Fig. 13
figure 13

Correlation between Chl-a concentration and climatic parameters

Conclusions

To our best knowledge, this study is the first attempt to calculate Chl-a concentration in TAR using the data extracted from both Sentinel-2A/B data and water quality collected from field campaigns. We constructed the models using both approaches, involving traditional band ratio regression with the simulated Sentinel-2A/B data and the state-of-the-art machine learning with water quality data to estimate and map Chl-a in a eutrophic tropical reservoir. The combination of NIR and green band ratio with an R2 of 0.72 μg/L and an RMSE of 5.95 μg/L was identified as the optimal model for quantifying Chl-a in such tropical inland waters. By comparison, both methods have satisfactory performance for Chl-a retrieval (R2 > 0.7). Specifically, GPR yields better results with the highest R2 of 0.79 and the lowest RMSE of 3.06 μg/L, compared to the other machine learning models.

The results illustrate the benefit of using machine learning models when it is hard to obtain cloud free satellite images, particularly in tropical regions. Both band ratio regression and machine learning model can support each other and be applied to water quality and environment management studies. Future studies are advised to repeat our sampling strategy to validate the model and link remote sensing reflectance data to machine learning algorithms.

Next, studies investigating the factors influencing the occurrence of HCBs using monthly water quality data had been carried out to elucidate the spatiotemporal variation. High TN and TSS in the rainy season were considered the driving factors of HCBs with Chl-a concentration greater than 20 μg/L in most areas of TAR. Regarding climatic parameters, Chl-a concentration was high due to a large amount of precipitation entering the study area, followed by solar irradiance while air temperature and wind speed made small contribution to HCBs.

The early prediction of HCBs is necessary to support a healthy practice of water usage in the community. The promising results of this study offer various approaches to the task of monitoring HCBs using satellite-based modeling or machine learning with water quality data. However, longitudinal cyanobacterial data collection is recommended for an accurate prediction of HCBs in TAR.