Introduction

Eutrophication is a worldwide environmental problem, originating from the excessive accumulation of nutrients in water bodies. The problem has drawn significant attention since the 1970s (Bouwman et al. 2002; Smith 2003); however, the issue emerged much later in China, along with the country’s economic growth in the 1990s (Le et al. 2010). Things have become worse in the new century, with sudden algal bloom break-outs in several important lakes, threatening ecological safety, and human health (Guo 2007; Goreau 2008). To face this challenge, the Chinese government proposed a national project in 2006, called the Major Science and Technology Program for Water Pollution Control and Treatment. The program was funded at a level of more than 12 billion RMB. Several rivers, lakes, and reservoirs were selected, with the goal of obtaining specific restoration experience in basins with different natural characteristics and social economic conditions.

Of the studied water bodies, Lake Erhai was selected as a representative preliminary eutrophic lake. Agricultural non-point pollution was identified as the main pollution source (Shang and Kong 2014). With the rapid increase in population and economic development, the water quality in Lake Erhai rapidly deteriorated after the 1990s. In 1996 and 2003, two severe algal blooms broke out in part of the lake. After two consecutive “Five-Year Programs” involving integrated remediation, the water quality of Lake Erhai has improved significantly in the last decade. In 2015, its water quality was graded as class II to III, according to the Water Environment Quality Standard of China (GB3838-2002). This means it meets the standard for a drinking water source (Wang et al. 2015).

However, protecting preliminary eutrophic lakes in developing countries often leads to conflicts with local economic development. The eutrophication process may accelerate during high-speed economic development. In the Lake Erhai basin, the local gross domestic product (GDP) growth rate in 2015 was 8.26%, which was faster than the average level in China (6.9%) and in the USA (2.6%). Thus, there remains a risk of future water quality deterioration. Moreover, both the positive and negative experiences in the management of the lake over the past 10 years should be carefully considered. Understanding the variations in water quality in Lake Erhai can help inform reasonable eutrophication control policies and environmentally friendly economic development strategies.

Until now, trends and variations in water quality for most lakes, especially in developing countries, have been demonstrated directly using raw monitoring data. The periodicity and stochasticity of variations in water quality are generally not accounted for, inhibiting the accurate assessment of its eutrophication state. Seasonal Trend Decomposition using Loess (STL) (Cleveland et al. 1990) has been demonstrated to be a reliable tool for analyzing long-term trends and variations (Sellinger et al. 2007; Shamsudduha et al. 2009; Wang et al. 2014a, b). STL uses a locally weighted running-line smoother to fit data, which can address nonlinear and local trends. Qian et al. (2000) and Stow et al. (2015) analyzed long-term and seasonal trends of nutrient concentrations and loads of two rivers in USA, respectively. The results indicate that STL method can exclude stochastic fluctuations in raw data and can accurately decompose water quality data into long-term and seasonal trends. With the development of different modeling tools, the modeling of variations in water quality has drawn increased attention. Of the well-developed techniques, (artificial neural network (ANN) has been shown to be a useful water quality modeling tool (Chebud et al. 2012; Khalil and Adamowski 2014). With ANN, the complicated, laborious, expensive, and time-consuming process of generating water quality data could be reduced by using more easily collected hydraulic data (Chang et al. 2015; Tomić et al. 2016). However, the previous researches always use raw monitoring data to forecast the water quality with ANN (Park et al. 2015; Xu et al. 2005). To date, no work has modeled water quality using an ANN method, which using STL composing data as input data.

This study analyzed Lake Erhai’s nutrient concentration data from 2006 to 2014, the longest time span seen in the literature on this topic (Yin et al. 2011; Jiao et al. 2013). STL was applied to analyze long-term trends and seasonal variations of TN and TP concentrations of 11 stations in Lake Erhai. Sensitive water quality parameters and sensitive sampling sites were selected using STL, revealing different long-term variations and patterns compared to other indices. Finally, the most sensitive parameter in the sensitive sampling sites was modeled using an ANN method. The study established a new approach to better understand and estimate trends and variations in nutrient levels, using more easily accessible physical parameters in a preliminary eutrophic lake.

Methods

Study area and data collection

Lake Erhai is in southwest China (25° 25′–26°16′N, 99°32′–100°27′E) and is the seventh largest freshwater body in China. Its surface area covers approximately 250 km2, with an average depth of 10.2 m, and a storage volume of 28.2 × 108 m3 (Zhang et al. 2015). The northern part of the basin is in an agricultural region. Three primary rivers (Miju River, Yongan River, Luoshi River) flow into the lake in that area, carrying many agricultural pollutants (Wang et al. 2013). The western region of the basin consists of a narrow plain between Cang Mountain and Lake Erhai, where both agriculture and tourism are highly developed. Eighteen rivers, called the Cangshan Eighteen Creek, flow into the west part of the lake. The east side of Lake Erhai consists of sharp embankments. The southern part of the basin is downtown of Dali City; there is significant domestic and tourism pollution in this region (Fig. 1).

Fig. 1
figure 1

Map of Lake Erhai and the location of inflow rivers (Shang et al. 2012)

Figure 1 shows the distribution of sampling sites. Each water sample was a mixture of water collected at 0.5 m below the water surface and water collected from the bottom of the lake at 0.5 m above the sediment. In the study period, the water samples were collected two times a month (1st and 15th) from 2006 to 2014. Table 1 displays the preliminary statistics of the water quality data and the data processing.

Table 1 Basic statistics of water quality data obtained from 11 monitoring stations along Lake Erhai (January 2006–December 2014)

The analysis scheme

This study developed a systematic analysis scheme to assess spatial-temporal water quality using artificial intelligence and statistics techniques (Fig. 2). A STL analysis was conducted to identify the sensitive water quality parameter and the sensitive area. In our research, the sensitive water quality parameter was defined as the water pollutant whose concentration value changed the most. The sensitive area was defined as the area where the sensitive water quality parameter changed the most. The nonlinear autoregressive model with exogenous inputs (NARX) dynamic ANN was used to model the concentration of the sensitive water quality parameter in the sensitive area and the input parameters were selected using correlation analysis. The main method is described as follows.

Fig. 2
figure 2

Study flow of the proposed systematic analysis scheme

STL

A time series of monthly environmental monitoring data at a selected location is defined as the sum of two components: one high-frequency seasonal component and one low-frequency long-term component (or trend component). Each individual observation is decomposed as

$$ {\mathrm{Y}}_{\mathrm{year},\mathrm{month}}={\mathrm{T}}_{\mathrm{year},\mathrm{month}}+{\mathrm{S}}_{\mathrm{year},\mathrm{month}}+{\mathrm{R}}_{\mathrm{year},\mathrm{month}} $$
(1)

In this expression, Yyear,month is the observed value for a given year and month; Tyear,month is the frequency of variation in the data, together with non-stationary, long-term changes in the levels. Syear,month is defined the variation in the data at or near the seasonal frequency. The term Ryear,month is the remaining variation in the data beyond the seasonal and trend components. The STL model applies one continuous loses line for the smoothed, long-term component, and 12–month specific loses lines for the seasonal component. The method of choosing smoothing parameters can be found in the reference (Cleveland et al. 1990). We chose window widths of 12 months and 48 months, respectively, in order to represent the window widths of the seasonal and long-term components. Periodic time windows were selected to reveal trends. A detailed description of the operational processes and parameter setting are provided in Cleveland et al. (1990).

Correlation analysis

Selecting input parameters for the water quality estimation model is an important part of the process. A larger number of input parameters can improve prediction accuracy; however, excess input data can lead to more redundancy in computing, hindering the simulation (Chang et al. 2015). The correlation analysis can reveal the internal correlation relationship between water quality parameters. Because of this function, the correlation analysis can be used to select the input parameters for ANN model. This study used a correlation analysis (Pearson correlation test) to select the input factors. The trend data (decomposed by STL) was used for the correlation analysis instead of the observation data, which is different from other research.

NARX dynamic ANN

The NARX dynamic ANN is one of the most widely used dynamic ANN. It retains data from a former operation and feeds it into the next data operation, giving it a dynamic feature, while also keeping complete information.

The NARX network contains input layers, hidden layers, and output layers. Recurring connections from the output may delay several unit times to form new inputs. The mathematical structure of the NARX network is shown in the following Eq. 1 (Chang et al. 2015):

$$ z(t)=f\left[z\left(t-1\right),...,z\left(t- dz\right);U(t)\right] $$

In this expression, U(t) and z(t) denote the input vector and output value of the model at a discrete time step t, respectively. The term f(·) indicates the nonlinear activation function that needs to be approximated by a learning algorithm.

The NARX network can be trained by two modes. The first mode is the series-parallel (SP) mode, where the output’s regressor in the input layer is formed only by actual values of the system’s output, d(t):

$$ z(t)=f\left[d\left(t-1\right),...,z\left(t- dz\right);U(t)\right] $$

The second mode is the parallel (P) mode, where estimated outputs are fed back into the output’s regressor in the input layer and can be mathematically represented as Eq. (1). The NARX network can be trained in the SP mode to construct the relationship between actual and estimated values of the target variable. Then the constructed NARX network in the P mode is applied to the unrecorded period for improving estimation performance with the recurrent information (the estimated values derived from the model).

The performance of the NARX dynamic ANN model is evaluated using common measures of goodness of fit: RMSE (root-mean-square error) and R (the correlation coefficient between the estimated value and observed value). The minimization of RMSE and maximization of R between experimental and modeled values was the target of appropriate estimation model, shown as follows (Young et al. 2015):

$$ R=\frac{\left(1/N\right){\sum}_{i=1}^N\left[{\left({Y}_m\right)}_i-\overline{Y_m}\right]\left[{\left({Y}_0\right)}_i-\overline{Y_0}\right]}{\sqrt{\left(1/N\right){\sum}_{i=1}^N{\left[{\left({Y}_m\right)}_i-\overline{Y_m}\right]}^2}\times \sqrt{\left(1/N\right){\sum}_{i=1}^N{\left[{\left({Y}_0\right)}_i-\overline{Y_0}\right]}^2}} $$
$$ \mathrm{RMSE}=\sqrt{\left(1/N\right)\sum \limits_{i=1}^N{\left[{\left({Y}_m\right)}_i-{\left({Y}_0\right)}_i\right]}^2} $$
$$ \overline{Y_m}=\left(\sum \limits_{i=1}^N{\left({Y}_m\right)}_i\right)/N $$
$$ \overline{Y_0}=\left(\sum \limits_{i=1}^N{\left({Y}_0\right)}_i\right)/N $$

In this expression, Ym is the estimated value, Y0 is the observed value, and N is the number of data points.

Statistical analysis

For nitrogen long-term data, significant differences among stations and years were evaluated through a parametric one-way analysis of variance (ANOVA). Significant (p < 0.05) differences were detected by a multiple Tukey comparison test. Statistical analyses were performed using the commercial software SPSS version 19.0 (SPSS, Inc., Chicago, IL, USA) (Fig. 3).

Fig. 3
figure 3

Spatial patterns of the long-term trends in TN and TP concentrations (in mg/L). Each column represents one sampling station

Results and discussion

Long-term trends in nutrient concentrations at the sampling sites

Long-term trends in total nitrogen (TN) concentrations appear almost the same across all sites. Between 2006 and 2014, TN concentrations trended progressively downward, with the exception of a sharp fluctuation in 2007–2008. The decreasing tendency of the TN concentration reflects the impacts of governance during these years. Starting in 2003, the local government initiated a water environment remediation project for three main inflow rivers in the northern basin (Miju River, Yongan River, Luoshi River). At the same time, estuarine wetlands for these three inflow rivers and a 3.3-km2 lakeside zone were constructed. These measures effectively reduced the TN load from the inflow rivers and agriculture production near the lake. Several decentralized waste treatment systems were also built to control rural domestic sewage from nearby villages. According to a survey, by 2010, there were 45 soil purification tanks in the basin (Zhang et al. 2013).

In 2007–2008, the TN concentrations at all sampling sites reached their minimum values. The values then increased to a new higher value that remained lower than levels before 2007. This outcome was likely due to the larger rainfall amount in 2007 (931 mm), which was higher than the multiyear average value (673 mm). The sudden change in river flow may have led to significant improvements in river water quality and impacts on ecosystems (Chang et al. 2015). This may have led to a decline in the TN concentration.

For the spatial comparison, TN concentrations at site 1 were significantly higher compared to other sites (Table S1; P < 0.05). This site was located near the northern lakeshore of Lake Erhai, a traditional agricultural and livestock rising area. This area includes 58% of the total farmland and 70% of the total livestock in the basin (Zhang et al. 2014; Lu et al. 2015). As a result, the northern region contributed the most agriculture non-point pollution produced among the whole basin. It was estimated that the three main inflow rivers (Miju River, Yongan River, Luoshi River) in the northern basin created approximately 47% of the TN load into the lake (Yan et al. 2005).

Because of the many restoration measures listed above, the TN concentration at the north stations (stations 1, 2, and 3) declined significantly between 2009 and 2014 (p < 0.05; Table S2). While TN concentrations remained almost the same across most of the sites, the trends in the TN concentrations differed at sites 4 and 5. Unlike other sites, the TN concentrations at sites 4 and 5 remained stable, rather than declining after 2010. These sites were near the northwest lakeshore of Lake Erhai, near a large farming village (Xizhou Town). In this area, rainfall, farmland irrigation water, and rural domestic wastewater flow into the lake through rivers or overland flow. This creates a large nutrient load. Our results indicated that while much has been done, the measures taken to prevent agriculture non-point pollution in the north-western region have not been sufficient. Stricter agricultural non-point pollution control policies should be implemented in this area in the future.

The TP concentrations at sites 1–5, which are in an agriculture and livestock region, progressively increased from 2006 to 2013. Concentrations then decreased in 2014. Studies showed that in the Lake Erhai basin, increased livestock and rural wastewater were the main sources of TP, creating 60% and 19% of the total TP load, respectively (Shang and Kong 2014; Wang et al. 2014a, b). From 2011 to 2014, livestock breeding in Lake Erhai basin developed more rapidly than previously. In 2014, meat and milk production increased 9.2% and 22%, respectively, compared to 2011 levels. However, livestock wastewater and animal dung were not well treated. Livestock wastewater was discharged into the lake without treatment, and the animal dung was habitually stacked on the roadside. Nutrient elements were then brought into the lake through rivers or overland runoff.

The inadequate governance of livestock pollution appears, therefore, to have been the dominant factor driving the increased TP before 2014. However, since 2013, there has been an increased exploration of livestock waste collection and recycling. In 2014, three livestock waste collection points were put into use, leading to the collection of 120,000 t of livestock waste. As a result, TP concentrations have declined since then; however, the TP concentrations at sites 9, 10, and 11 were higher than other sites (Table S1). This may be related to the geographical position of these sites. Sites 9, 10, and 11 are located in the southern part of Lake Erhai, which adjoins downtown Dali City. The tail water from the Xiaguan municipal sewage plant is the main source of the TP load in this region. The modeling indicates that in downtown Dali City, the treatment ratio of urban sewage was 93.65% and the TP load of the tail water was 48 t/a (Bai et al. 2015). This may help explain the higher TP concentration in this area.

Unlike other sites, the TP concentration at site 6 rose consistently between 2006 and 2014. Site 6 was located near the west lakeshore of Lake Erhai, next to Dali Ancient Town, a famous scenic spot. In recent years, as tourism has developed, the number of tourists in Dali has increased rapidly (from 6 million in 2010 to 9 million in 2014). This increase of tourists may have led to the increased nitrogen and phosphorus in the water (Thebault and Qotbi 1999; Karydis and Kitsiou 2012). Thus, the government should promote ecological tourism and strengthen wastewater treatment in this area.

Seasonal patterns of nutrient concentrations at the monitoring stations

The former study showed that for spatial distribution, the west part of Lake Erhai contains most TN and TP contents, in percentages of 51% and 60%, respectively (Bai et al. 2015). This is because of the higher population density, more developed agriculture, and increases in the livestock industry. In this study, the TN long-term concentrations of stations in west part of Lake Erhai (stations 1, 4, 6) were higher than those stations in the center and east part of Lake Erhai (Table S1). Therefore, we chose the stations 1, 4, 6, and 9 to do seasonal pattern analysis. Thus, the addition of long-term trend data and seasonal data from these sites were selected to analyze the seasonal patterns.

In our research, TN concentration trends followed a regular seasonal pattern, with the highest value occurring from July to October (Fig. 4; Table S3). This seasonal pattern may relate to the rainfall pattern and crop rotation system in Lake Erhai basin. The climate of the Lake Erhai basin is consistent with a typical low-latitude plateau and is a subtropical and southwest monsoon type (Tang et al. 2012). Seasons alternate significantly between the dry and rainy seasons. Historical data indicate that 85–96% of the rainfall is concentrated between May and October, with a monthly peak of 217 mm in August (Dearing et al. 2008).

Fig. 4
figure 4

Seasonal patterns in the long-term nutrient concentration (in mg/L) trends at sites 1, 4, 6, and 9 for TN and TP. Each column represents the long-term trend in a specific month (calculated as the sum of first two terms in Eq. 1 for the month). The red line represents the long-term mean for each month

A crop rotation system has been established in the study region for more than 100 years. Spring crops, including rice, maize, and tobacco, are planted from June to September. Winter crops, including broad beans, garlic, barley, and rape, are planted from September to April of the following year (Shang et al. 2012). Therefore, recently applied fertilizer is scoured by abundant rainfall runoff (Chen et al. 2016), which finally converges into Lake Erhai. This may be the dominant factor driving the increasing trend in TN concentration from July to September. For winter crops, fertilizer is applied in October. Garlic needs more fertilizer than other crops. More specifically, the amount of nitrogen fertilizer applied to garlic is four to five times the amount it takes up (Tang et al. 2012). Therefore, although October is the tail of the rainy season, the large amount of fertilizer may induce a substantial TN load into Lake Erhai.

The TP concentration experiences a similar seasonal pattern as TN. The TP concentration was significantly higher from July to October, and lower in other months (Fig. 4; Table S3). The seasonal variations in TP concentrations were also due to rainfall patterns in the basin. However, unlike TN, the TP concentrations remained high throughout the rainy season. Although the rainfall leads to both scouring and dilution, an additional reason explains this outcome. The P releasing from sediments have become an important source of P loading in lakes (Boström et al. 1988; Caraco et al. 1989). For example, Liu et al. reported that the release flux of TP at the sediment-water interface of Lake Erhai is estimated at 114.2 t/a (2015). Jin et al. indicated that the low DO value of overlying water may cause anaerobic P release at the sediment-water interface of the lake (2006). In this study, Fig. 5 shows that oxygen concentrations in Erhai Lake were higher than 5.5 mg/L, indicating aerobic conditions, but from July to October, the DO concentrations were lower than other months. Considering that the water sample was a mixture of surface water and overlying water, there may be an anaerobic condition in sediment-water interface, which may be the reason for higher P in these months. The P release may also be affected by stream and wind (Klein 2008). Therefore, during the rainy season, disturbances from rainfall, river discharge, and DO may have led to phosphorus releases and the high TP concentration from July to October (Pant and Reddy 2001; Kim et al. 2003).

Fig. 5
figure 5

Seasonal patterns for DO. The red line represents the long-term mean for each month

The TP concentration at site 6 increased persistently between July and October, reaching its highest value in October. This was different from other sites. This may be due to the sudden increase in tourist population between October 1 and 7, the National Day Holiday of China. For example, during the National Day Holiday in 2014, Dali received approximately 450,000 tourists, an average of 60,000 tourists each day. This was three times the mean value for that year. These high numbers of tourists may produce significant levels of domestic wastewater and solid waste, resulting in a higher TP concentration in October. This result indicated that, despite agriculture non-point pollution, pollution caused by tourism has become an emerging pollution source in the Lake Erhai basin.

The relationship between TN and TP concentration and rainfall can highlight nitrogen and phosphorus sources (Duan and Zhang 1999). Non-point source pollution is significantly affected by rainfall; however, point source pollution is relatively stable. The seasonal TN and TP concentration patterns in site 9 held at steadier levels than at other stations. This may because pollutants in the southern part of the Lake Erhai basin mainly come from Boluo River and tailwater from WWTPs. The nitrogen and phosphorus concentrations are simultaneously impacted by non-point and point source pollution, making the effect of rainfall less clear.

Modeling of the long-term TP trends at site 6 using an NARX-ANN approach

Long-term TN trends at all sites and long-term TP trends at most sites decreased during the study period; however, TP at site 6 showed a significant increasing trend across these years. Therefore, site 6 and its TP levels were selected as the sensitive area and sensitive water quality parameter, respectively. Afterward, a correlation analysis was conducted before the ANN analysis to select the input parameters.

We selected the input indicators by examining the correlation between TP and other water quality parameters at site 6. Trend data (decomposed by STL) and observed data were used for the correlation analysis. In the observed data analysis, TP was not significantly correlated with other parameters. However, when conducting the trend data analysis, TP was significantly correlated with COD, DO, pH, and water temperature. This may because trend-based data can reflect variation tendencies more accurately than observed data. As such, a correlation analysis based on trend data can more clearly reveal the correlation between parameters (Chang et al. 2015). Although TP was correlated with COD, DO, pH, and water temperature, based on data availability, we selected the more easily accessible physical parameters as input variables: DO, pH, and water temperature (Table 2).

Table 2 Correlation of water quality parameter pairs at site 6 (January 2006–December 2014)

Using the previous three selected factors as input variables and the TP concentrations as the output variable, this study adopted NARX-ANN to estimate the TP concentration at site 6. The neural network was trained using 70% of the data (2006–2014), and the validation and testing was conducted using 15% of the data. Table 3 presents the model performance in the testing stages. The RMSE was sufficiently low, and the R value was sufficiently high, to demonstrate that the model performed wells. Figure 6 shows the modeling time series of TP concentrations associated with NARX-ANN. The results demonstrated that (a) this model fit the time series of TP well and (b) the model captures the peak value during the rainiest season (June–September). This is critical for water pollution control. In summary, the NARX network coupled with the correlation analysis can delineate TP concentrations using COD, DO, and pH.

Table 3 Performance of TP concentration modeling at site 6 in the testing stages
Fig. 6
figure 6

Performance comparison of regional TP concentrations

Conclusion

This study developed an integrated approach containing STL and a dynamic nonlinear autoregressive model with exogenous input (NARX) network to decompose and estimate the nutrient concentrations in Lake Erhai, a preliminary eutrophic lake in China. The outcome shows the positive effects of integrated remediation over the past 10 years, as well as deficiencies. From 2006 to 2014, TN was successfully controlled, as its concentration progressively descended in most part of Lake Erhai. In the area near the tourist attractions, TP increased continuously from 2011 to 2014, which means that the government should not only strengthen the agriculture wastewater treatment but also promote ecological tourism. TP was successfully modeled to assess variations in sensitive parameters at sensitive sites, facilitating the modeling of nutrient concentrations using more easily obtained physical parameters.