1 Introduction

Lake level fluctuations are significant for lakeshore structure planning, designing, building, and operation, as well as for managing freshwater lakes for water supply purposes. To regulate future lake level fluctuations, methods for modeling high or abnormal level fluctuations should be devised. The level measurements, or their future equally likely reproductions acquired using a simulation model, are straightforward manners of obtaining lake management decision variables. Although comprehensive models incorporating hydrological and hydrometeorological variables, such as precipitation, runoff, temperature, and evaporation, can be found, it is more economically advantageous to use models that simulate lake level fluctuations based on past level records (Şen et al. 2000).

Lakes are used for various domestic, industrial, and agricultural purposes (Vuglinskiy 2009; Shiri et al. 2016). Forecasting lake water levels is crucial for water resource planning and management, lake navigation, tidal irrigation, and agricultural drainage canal management, etc. Lake water level is a complicated phenomenon; it is primarily influenced by natural water exchange between the lake and its watershed, and consequently reflects hydrological changes in the watershed (Altunkaynak 2007; Karimi et al. 2012). For many practical applications, a model that forecasts lake level fluctuations based on previously measured levels is required (Karimi et al. 2012).

Over the last few decades, numerous researchers have studied lake water level models because global climate change impacts the hydrological cycle, causing many lakes to dry up or flood unexpectedly. To model lake level fluctuations, several techniques have been devised. Şen et al. (2000) employed periodic and stochastic processes. Altunkaynak et al. (2003) used the diagram model and Markov process. Altunkaynak (2007) employed an artificial neural network. Altunkaynak and Şen (2007) used fuzzy logic. Kişi (2009) used a wavelet conjunction model. Karimi et al. (2012) employed gene expression programming and an adaptive neuro-fuzzy inference system (ANFIS). Sanikhani et al. (2015) also used ANFIS and gene expression programming. Young et al. (2015) used a timeseries forecasting model. Shiri et al. (2016) employed an extreme learning machine approach. Shafaei and Kisi (2016) employed the wavelet-support vector regression (SVR), wavelet-ANFIS, and wavelet-autoregressive moving average model conjunction models. Liang et al. (2018) used a deep learning method. Peprah et al. (2021) employed integrated moving average and kalman filtering techniques. Luo et al. (2021) used machine learning methods.

Recently, three machine learning techniques, multivariate adaptive regression splines (MARS), least-square SVR (LSSVR), and M5 model tree (M5-tree), have achieved a remarkable emerging and promise in addressing difficult nonlinear situations. These techniques have been widely employed in solving hydrologic challenges (Yaseen et al. 2016; Kisi et al. 2017a, b; Demir and Çubukçu 2021). MARS is a newer artificial intelligence technique (Friedman 1991). The ability to capture the natural difficulty of data mapping in high-dimensional data patterns, a rapid and adaptable model and accurate forecasting of continuous and binary output variables are its main advantages. Further, this nonparametric statistical method provides a versatile procedure for organizing the relationship between input and output variables with fewer variable interactions (Leathwick et al. 2006). Rainfall and temperature forecasting, streamflow forecasting, sediment concentration estimate, water pollution forecast, air pollutant forecast, freshwater distribution system modeling, and drought events river flow simulation are previous water resources applications of MARS (Leathwick et al. 2006; Sotomayor 2010; Adamowski et al. 2012; Kisi 2015a; Shortridge et al. 2015; Kisi and Parmar 2016; Yaseen et al. 2016; Kisi et al. 2017b).

LSSVR is a modified variant of SVR that can solve problems involving quadratic programming (Suykens and Vandewalle 1999). It also avoids some flaws that other data-driven learning systems have (e.g., local minima, time consumption, and overfitting) (Ji et al. 2014). In the engineering field, LSSVR has been successfully applied; e.g., to predict wastewater effluent parameters (Huang et al. 2009), design the structural components of a wing-box for an airplane (Deng and Yeh 2010), design a superconducting magnetic energy storage controller with adaptive dampening (Pahasa and Ngamroo 2011), forecast CO2 emission in reservoir (Shokrollahi et al. 2013), analyze oil recovery (Kamari et al. 2014), forecast reservoir oil viscosity (Hemmati-Sarapardeh et al. 2014). In the hydrological study, few studies have been conducted using LSSVR; for example, streamflow forecasting and estimation (Kisi 2015b; Yaseen et al. 2016; Kisi et al. 2017b), daily water demand and dam daily inflow estimation of (Hwang et al. 2012), sediment transport modeling (Kisi 2012), daily reference evapotranspiration modeling (Kisi 2013), reservoir inflow modeling (Okkan and Ali Serbes 2013), water pollution forecast (Kisi and Parmar 2016), air pollutant forecast (Kisi et al. 2017a).

M5-tree is a data mining technique that uses the divide-and-conquer method to split data timeseries into subspaces, allowing a multidimensional parameter space to be divided and the model to be automatically generated on the basis of the overall quality requirement (Quinlan 1992). Scholars recently investigated the M5-tree's utility in many hydrological applications, such as water level optimization (Bhattacharya and Solomatine 2005), precipitation and river flow modeling (Solomatine and Dulal 2003), streamflow modeling (Yaseen et al. 2016), wind speed modeling (Başakın et al. 2022), air pollutant modeling (Kisi et al. 2017a), evapotranspiration modeling (Pal and Deswal 2009), pan-evaporation modeling (Kisi 2015a), flood events (Solomatine and Xue 2004), and sediment yield modeling (Goyal 2014).

Based on the reported database of Scopus for “machine learning” and “lake level” over 161 document results were appeared. A set of significant keywords for this study domain has been created using the VOSviewer algorithm (Fig. 1a). In addition, when the adopted research is analyzed based on the time scale (Fig. 1a), it is seen that many studies have been published in 2016 and beyond. These studies seem to have more research interest on data science, prediction, time series, water quality, classification ect., and new machine learning models such as deep learning, support vector machine, random forest, regresion tree, extreme learning machine etc. Figure 1b shows the main regions where machine learning and lake level have been investigated. It is the region of China with the most research (47), followed by USA (40), Canada (18), Iran (17).

Fig. 1
figure 1

The literature review keywords (a) and research regions (b)

In this study, the major goals are to (i) investigate three different novel heuristic regression techniques (MARS, LSSVR, and M5-tree) for modeling water level forecasting, (ii) investigate the influence of the periodicity component (months of the observed data) for water level forecasting, (iii) demonstrate the effectiveness of the proposed models; Lake Michigan in the USA was employed.

2 Case study and data preparation

The name, Lake Michigan comes from the Ojibwa term, Michi Gami, which means “large lake”. Lake Michigan is in the USA (coordinates: 44°N 87°W); it is the third-largest lake in the Lake District, comprising five interconnected large lakes, and the sixth-largest freshwater lake globally (see Fig. 2). With a surface area of 58,016 km2, drainage area of 118,095 km2, a width of 48–193 km, a length of 494 km, and deepest point of 0.281 km, the lake is the only lake in the middle northeast of the USA, among the Great Lakes, which remains entirely within the country’s territory (Michigan 2021a). Lake Michigan is bordered by Wisconsin to the west, Illinois and Indiana to the south, and Michigan to the east. The surface of the lake, whose waters are fresh, is 0.177 km above sea level. It is connected by the Strait of Mackinac to Lakes Superior, Huron, Erie, and Ontario from its northeast corner (Michigan 2021b).

Fig. 2
figure 2

Study area: (a) Great Lakes, and (b) Lake Michigan (Michigan 2021a) (b)

Forecasting lake level fluctuations is critical for many operations in Lake Michigan region, including flood mitigation, reservoir management, drinking water distribution, water infrastructure management, trade, transportation, and beach erosion. The observed data are 103 years (1236 months) long with an observation period between 1918 and 2020 for the Lake Michigan station (IGLD 1985: Brochure on the International Great Lakes Datum 1985). The observed data were acquired from the report of the U.S. Army Corps of Engineers website: “https://www.lre.usace.army.mil/Missions/Great-Lakes-Information/Great-Lakes-Information-2/Water-Level-Data/.” The statistical parameters of the data used during the study period are shown in Table 1. The observed lake level fluctuation data for Lake Michigan, as well as the training and test datasets, are shown in Fig. 3.

Table 1 Monthly statistical information of datasets for Lake Michigan
Fig. 3
figure 3

Lake Michigan water level fluctuations and training–test datasets

The partial autocorrelation and autocorrelation functions of the lake levels for Lake Michigan are also shown in Fig. 4. The figure shows that the lake level in Lake Michigan highly correlates with past month levels. The partial autocorrelation function indicates a significant correlation up to lag 8 for Lake Michigan and then stays within the confidence interval.

Fig. 4
figure 4

Monthly lake level autocorrelation and partial autocorrelation coefficients for Lake Michigan

3 Methods

3.1 MARS

Friedman proposed the MARS model, which is a nonparametric regression model (Friedman 1991). MARS is a model used to forecast nonlinear continuous numerical results. It explains the complex nonlinear relationship between a model, estimation method, and dependent variables. The MARS algorithm comprises two steps: forward and backward steps. It selects a set of suitable input variables with the forward step algorithm (De Andrés et al. 2011). With the backward step algorithm, it eliminates unnecessary variables in the preselected set. A graph is plotted from variable X to the new variable Y by two base functions or both variable values defined at the deviation point across the input range using the following fundamental equations (Sharda et al. 2006):

$$Y = \max (0,X - c)$$
(1)
$$Y = \max (0,c - x)$$
(2)

where c represents the threshold (lower limit) value. The MARS model is used especially in financial affairs management systems and timeseries data (Sephton 2001; Bera et al. 2006; Yaseen et al. 2016; Kisi et al. 2017a, b; Demir and Çubukçu 2021).

3.2 LSSVR

LSSVR is an extension of SVR, proposed by Suykens and Vandewalle in 1999 (Suykens and Vandewalle 1999). It is employed to statistically estimate water levels with the water levels in the historical timeseries and obtain the optimum function between the X input and Y output (Yaseen et al. 2016). It performs this operation with a nonlinear relationship function in a multidimensional feature space. The regression function can be expressed as follows:

$$y(x) = w^{T} \varphi (x) + b$$
(3)

Where y is the value obtained in x, w is the coefficient vector, φ is the mapping function, b is the bias term obtained from minimizing the generalization error’s upper bound (Suykens and Vandewalle 1999).

3.3 M5-tree

M5-tree algorithm is a new regression method developed by Quinlan in 1992 (Quinlan 1992). Its backbone is a two-component decision tree. The method defines the relationship between the independent and dependent variables with a linear regression function applied to the final leaf nodes. M5-tree is better than other decision tree models used for categorical data (Mitchell 1997).

M5-tree is a two-stage model. In the first stage, data are split into subsets to produce the decision schema (tree). The standard deviation of the class value reached at a node is used to classify. The expected reduction is calculated on the basis of the error that occurs due to testing the elements acting on this node. (Solomatine and Xue 2004; Pal and Deswal 2009). The expression of the standard deviation reduction (SDR) is as follows.

$$SDR = sd(T) - \sum {\frac{{\left| {Ti} \right|}}{\left| T \right|}} sd(Ti)$$
(4)

In this formula, sd is the standard deviation, and T represents a set of instances acting on the node. Subset samples with “i” results of potential data are represented by Ti (Quinlan 1992).

4 Results and discussion

The three heuristic regression techniques evaluated (MARS, LSSVR and M5-Tree) were created using MATLAB subroutines to estimate the lake levels forecasting. The data were divided into the training and test datasets before modeling. The training dataset accounts for 80% (1236 × 0.8 = 989), whereas the test dataset accounts for 20% (247). Quantitative indicators are commonly used to evaluate hydrological applications. Legates and McCabe (1999) suggested that predictive models in the hydrology field be tested using goodness-of-fit methods, e.g., root-mean-square error (RMSE), mean absolute error (MAE), and coefficient of determination (R2), as shown in Eqs. (5)–(7), respectively (Legates and McCabe 1999).

$$\left. {RMSE} \right|RMSD = \sqrt {\frac{1}{N}\sum\limits_{i = 1}^{N} {(L_{e} - L_{o} )^{2} } }$$
(5)
$$MAE = \frac{1}{N}\sum\limits_{i = 1}^{N} {\left| {L_{e} - L_{o} } \right|}$$
(6)
$$R^{2} = \frac{{\left[ {\sum\limits_{i = 1}^{N} {(L_{e} - } \overline{L}_{e} )(L_{o} - \overline{L}_{o} )} \right]^{2} }}{{\sum\limits_{i = 1}^{N} {(L_{e} - \overline{L}_{e} )^{2} \sum\limits_{i = 1}^{N} {(L_{o} - \overline{L}_{o} )^{2} } } }}$$
(7)

In Eqs. (5)–(7), Le and Lo indicate the estimated and observed water levels values, respectively, and N indicates the raw water level amount of data. This study aims to forecast lake level fluctuations by MARS, M5-Tree, and LSSVR. In this context, different input combinations were explored to forecast the water levels. The inputs include the previous monthly lake levels (t − 1, t − 2, t − 3, t − 4, t − 5, t − 6, t − 7, t − 8), and the outputs correspond to the lake level at time t.

The results of heuristic regression techniques in terms of RMSE, MAE, and R2 are summarized in Table 2 with input combinations. RMSE ≤ 60 cm indicates an excellent and appropriate estimate (Coulibaly 2010; Sanikhani et al. 2015). For Lake Michigan, RMSE ≤ 7.4 cm is an excellent and satisfactory estimate.

Table 2 Results of heuristic regression techniques

According to the training results, the input combination (t − 1 to t − 8) had the most significant effect on forecasting lake levels of t. M5-tree yielded the least error in the training phase, followed by LSSVR and MARS with little difference. In the test phase, the input combinations were compatible with the autocorrelation and partial autocorrelation in Fig. 4 in MARS and LSSVR. However, errors increased after the fifth combination (t − 1 to t − 5) in M5-tree. MARS were followed by LSSVR and M5-tree, which yielded the least error and was closest to the best fit. The timeseries plot of the best results for each method and the scatter plot are depicted in Figs. 5 and 6.

Fig. 5
figure 5

Observed and forecasted lake level timeseries and scatter plots for the training phase

Fig. 6
figure 6

Observed and forecasted lake level timeseries and scatter plots for the test phase

In the second modeling part, the periodicity data component was examined and evaluated. In reality, the main purpose of integrating this periodic subdata, which is one year to forecast one month ahead, was to provide the modeling with an external flow pattern that could yield a more comprehensive understanding and higher outcome accuracy (Yaseen et al. 2016). The outcomes of the training and test phase for periodic heuristic regression techniques are summarized in Table 3. The addition of the periodicity component increased the average performance in all models. In particular, it improved the model test performance accuracy in terms of RMSE and MAE by 15.53–13.25% (For example RMSE for best MARS:0.0425–0.0425 × 0.1553 = 0.0359), 11.24–8.43%, and 4.98–11.08% for best MARS and best LSSVR, respectively.

Table 3 Results of heuristic regression techniques with periodicity data component

The timeseries plot of the best results for the methods and the scatter plots are depicted in Figs. 7 and 8.

Fig. 7
figure 7

Observed and forecasted lake level timeseries and scatter plots for the periodic training phase

Fig. 8
figure 8

Observed and forecasted lake level timeseries and scatter plots for the periodic test phase

Figures 5 and 6 better represented the values observed in Figs. 7 and 8 with the effect of periodicity. The values observed during the training phase in Figs. 5 and 6 were generally captured by the models. In other words, it was generally forecasted correctly. However, although the long-term periodic fluctuations of the values observed in the test phase in Figs. 7 and 8 were well predicted, the short-time fluctuations were under estimated according to the three techniques. Table 2 better represented the values observed in Table 3 with the effect of periodicity. From Table 2, P-MARS (VIII inputs) in all datasets yielded lower RMSE and MAE and higher R2 values than the others (RMSE = 0.0359; MAE = 0.0288; R2 = 0.9922). The worst results were obtained from the (I) 1 input model using M5-tree (RMSE = 0.074; MAE = 0.058; R2 = 0.967). In Fig. 8, P-MARS (VIII inputs) yielded better estimates than others, especially in the scatter diagrams (assuming the equation is y = ax + b), and the coefficients a and b (in the linear equation a and b are closer to 1 and 0, respectively). The reason behind this was that the MARS structure and the periodicity data component could accurately model the highly nonlinear lake level process (Yaseen et al. 2016).

Although MAE, RMSE, and R2 error criteria demonstrated the accuracy of the estimated variables, these error statistics do not reveal information about the distribution of the models (Citakoglu 2021). Therefore, the Taylor diagram and Violin plot containing statistical analysis was used while comparing methods (Taylor 2001; Legouhy 2021; Başakın et al. 2022). Taylor diagram assessed compliance of estimation data with observed data. With the use of the Taylor diagram, further comparisons of the models were achieved. Taylor diagrams for the best result of MARS, LSSVR, M5-tree models and P-MARS, P-LSSVR, P-M5-tree models are presented in Fig. 9. As can be inferred from Fig. 9, the models were relatively similar to each other. However, P-MARS was separable from the other approaches at the Taylor diagram. P-MARS model yielded quite close to observed data. Therefore, the Taylor diagram also revealed that the P-MARS approach was more successful than the other models.

Fig. 9
figure 9

Taylor diagrams of MARS, LSSVR and M5-tree models for testing phase

Violin plot shows the compatibility of the forecast data with the observed data with the help of statistical parameters. An additional comparison was made using the Violin plot for the models (Fig. 10). The Violin plot used in this study was modified from Legouhy (2021). The model results are expressed in the original part (before) at the first stage, and then the results using the periodic component are given in the after part. With the addition of the periodic component, the performance of the MARS and LSSVR methods increased, but the performance of the M5-tree method decreased. This situation is understood by the similarity of the observed violin graph and the graphs of the other methods.

Fig. 10
figure 10

Violin plot of MARS, LSSVR and M5-tree models for testing phase

Finally, in this study, a statistical significance test was performed between the results of the best methods and the observed data. The Kruskal–Wallis (KW) test was used to determine whether the estimated and measured data had similar distributions (Citakoglu 2021; Görkemli et al. 2022). As seen in Table 4, the H0 hypothesis is rejected in the estimations of the methods of lake level fluctuations. In other words, it shows that there is no significant difference between the means of the forecasted and observed data. The KW test was performed at 95% of the confidence interval, and the critical value was pcri = 0.05.

Table 4 Kruskal–Wallis test results

5 Conclusion

In this study, the applicability of MARS, M5-tree, and LSSVR in forecasting lake level fluctuations was investigated. Lake level observations from Lake Michigan in the USA were used for training and testing the three models. In terms of performance indices, the results demonstrated the effectiveness of the three models in reproducing the nonlinear behavior of lake level fluctuation.

  • In both models with and without the periodic component introduced as input data, MARS performed slightly better than the LSSVR and M5-Tree model tree.

  • In general, P-MARS indicated better forecast accuracies at input combinations VIII, mainly due to the capability of the application of multivariate adaptive regression, which could capture the complicated nonlinear relationship.

  • Modeling using a single input (I) yielded the worst result in the estimation with M5-tree.

  • The periodic component feature was embedded and evaluated inside the modeling's input datasets, and the results revealed that integrating this component data was useful in offering a detailed intuition into the process of anticipated monthly lake levels.

Where resources are not available to operate complicated physically based models, the proposed heuristic regression techniques may be useful practical options for improved monthly lake level forecasts. In operational water level forecasting, the proposed methods could be valuable supplements to physically-based models. Understanding the causes of water level fluctuations and the factors influencing them can help in lake conservation and management. It is critical to keep water levels in Lake Michigan at a healthy level for the ecosystems and marshes that surround them. Effective strategies for sustainable integrated water resources management should be implemented to preserve ecological integrity and assure the water release and storage capacity of the Great Lakes under the pressure of unpredictable climate variables.