Abstract
This research paper investigates a cubist method as a rule-based regression predictive model for index rainfall (IR) estimation. The IR is required both in the regional frequency analysis procedure and in the evaluation of probable maximum precipitation. This IR is still considered a basic means in the rainfall-runoff transfer process. Data used include annual maximum rainfall from 75 rain gauge stations in the Cheliff watershed (Algeria). The data have geographic information and annual precipitation values. The adopted model was trained on 70% of the available data with optimized hyper-parameters using the leave one out cross-validation (LOOCV) technique. The remaining (30%) of the data were used as a testing set for evaluation. Three metrics: Correlation Coefficient (R), Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE), were used to measure the prediction performance of the regression model. Finally, the results compare models with and without introducing climatic input.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
The index rainfall (IR) is a variable defined as a central tendency (mean or median of series of values of rainfall), which poses a real problem as to its estimation in ungauged regions, using growth curves resulting from a regional frequency analysis such as L-moments for example in [1].
Also, This IR is useful when it comes to evaluating the probable maximum precipitation by methods derived from the general formulation of Chow [2,3,4,5], which are providing a valuable tool for the hydrologic design of hydraulic structures. The procedure developed by Hershfield [3, 4] and later modified by Hershfield [5] is based on the general frequency equation of Chow [2]. This technique requires a series of maximum annual daily rainfall measurements at a particular observation point as the input data [6].
The IR is still considered a basic means in the rainfall-runoff transfer process [7]. Several models (Crupedix, Sogreah, and Socose) should be listed since they are applied in Algeria; for more details refer to [8]. All these methods are a useful tool used in the hydraulic design of flood protection infrastructures and flood risk management [9, 10].
The present contribution aims to model the index rainfall (IR) in Cheliff watershed (Algeria) using: (i) exclusively the mean annual precipitation as predictor in the simple linear model of Body [11]. (ii) the combination of the geographical coordinates with the mean annual precipitation as the predictors and (iii) exclusively the geographical coordinates of the measurement stations as predictors. So, we will explore in the two later models, the predictive capabilities of the rule-based technique named Cubist model (which is an improvement of the M5Trees model).
In Algeria, this aspect of the study has not been supported by published studies excepted for the document of Body [11]. In this document, the author uses the isohyets of the mean annual rainfall provided by the maps of Chaumont and Paquin [12], which explain the spatial distribution of the IR through three regional formulations of simple linear regression. These three relationships applicable in (Algeria: east, west, and Sahara) were based on series of rainfall measurements for the period (1913–1963).
2 Material and Methods
2.1 Study Area
The hydrographic basin named Cheliff is located in the northwest of Algeria country. It is limited between geographic coordinates 0.36° and 3.36° of longitude East and 34.49° and 36.38° of latitude North (Fig. 76.1, Table 76.1). It is coded (01) among (17) other watersheds, and it covers the area of 43,750 km2. It is considered the largest basins in the country. The climate type is the semi-arid Mediterranean with warm summers and cold winters. The precipitations have a large variability with a trend of decrease from north to south and from east to west. The mean annual precipitation ranges from 153.80 to 599.50 mm (Table 76.1).
2.2 Data Description
The first analysis of the dataset revealed that there are 92 rain gauge stations, with maximum daily rainfall records in the Cheliff watershed. However, some stations only have short records. To satisfy statistical requirements, we selected the stations with at least 20 years of records. Finally, according to this principle, 75 rain gauge stations were chosen with an average length of 48 years. All the stations, numbered from 1 to 75, are located at various places throughout the study area (Cheliff basin), as shown in Fig. 76.1. All data set used in the present study are provided by National Agency for Water Resources. (NAWR); in French: Agence Nationale des Ressources Hydrauliques [7].
The relationship between the index rainfall (IR) and the predictors is shown by the correlation matrix plot (Fig. 76.2), which is the pairwise relationship between two variables with corresponding correlation coefficients for each indicator. The correlations and histograms for a data matrix are given in Fig. 76.2.
2.3 Methodology
2.3.1 Simple Linear Regression
The simple linear regression (SLR) estimates the index rainfall IR as a linear function of one predictor (X) and written: \( IR = a_{0} + bX\); where a0 and b are the linear coefficients for the SLR. The coefficients are estimated using the well-known least square method.
In Algeria, this aspect of the study has been maintained in the document of Body [11], in which the author uses the isohyets of the mean annual precipitation (MAP) from the maps of Chaumont and Paquin [12] to explain the spatial distribution of the index rainfall (IR) through three regional relations of simple linear regression. These three relationships applicable in (Algeria: East, West, and Sahara) were based on series of rainfall measurements for the period (1913–1963) [11]. The Cheliff watershed is concerned by the first relation written as:
2.3.2 Cubist Model
Rule-based models consist of one or more crossed if/then conditions for the predictors that divider the data [13]. Within these dividers, a model is used to forecast the response [13].
Cubist is a rule-based model that is an extension of Quinlan’s M5 model tree [14]. These models are based on the predictors used in previous splits. Also, there are intermediate linear models at each step of the tree. The modern version of model rules was called Cubist. There were small technical differences between Cubist and M5 rules enumerated in Kuhn and Johnson [13], “but the main improvements were: (i) an ensemble method for predictions called committees, and (ii) a nearest-neighbor adjustment that occurs after the model predictions” [13].
2.3.3 Cross Validation
Performances of both models with best parameters, were evaluated through a leave-one-out cross-validation approach (LOOCV). We used the covariates of 70% of all data to perform cross-validation and determine the best parameters. Then, the final models were built using the best parameters. Finally, the covariates of 30% of the remaining data were applied to the models to predict index rainfall.
The LOOCV is a particular case of K-fold cross-validation (K = n = length of the sample equal to 75 cases), which is one of the most commonly used methods of evaluating the predictive performances [15].
2.3.4 Accuracy Assessment
There is no single statistic that captures all aspects of interest [16, 17]. For this reason, it is useful to consider important performance statistics. In the following definitions, Oi represents the ith observed value and Mi represents the ith modelled value for a total of n observations. Om and Mm are the average values of Oi and Mi. The accuracy of the predictions was assessed using some evaluation indices such as Correlation Coefficient (R), Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE). The expressions and the brief explanations for these metrics are given below.
2.3.4.1 Correlation Coefficient (R)
The (Pearson) correlation coefficient (R) measures the strength of the linear and the agreement between observed and predicted (modelled) samples, i.e., how close the model predictions fall along a 45-degree line from the origin to the observed data.
2.3.4.2 Root Mean Squared Error (RMSE)
The RMSE is a commonly statistic used that provides a good overall measure of how close modelled (predicted) values are to observed values.
2.3.5 Mean Absolute Error (MAE)
The MAE is the average absolute difference of the estimated (predicted) value from the reference (observed).
3 Material and Methods
In the presented study, the proposed models were: (i) SLR, (ii) Cubist. All data processing in this study was performed using R software and Excel of Microsoft.
3.1 Results and Hyperparameters Tuning
In the process of the rule-based regression of the Cubist models and of the simple linear regression, an approximate function is determined to predict the value of the output. First, the datasets are randomly separated to a training set and a test set. In this contribution, approximately 70% (54 cases) of the available data (75 cases) were randomly chosen to establish the training phase. While, 30% (21 cases) of the available data were used for testing purpose. The test set was used to evaluate the accuracy of the function and estimate the performance of the adopted models. The input parameters (longitude (Lon), latitude (Lat), altitude (Alt), and mean annual precipitation (MAP)) in the Cubist models are the variables that affect the prediction target (index rainfall (IR)). For the Simple Linear Regression, the relation (76.1) of Body [11] is evaluated to predict the same target (IR), using Excel software.
In the application of Cubist experiments, the caret package in R [18] was utilized to build a rule-based regression model. The Cubist regression model, which has two tweaking parameters that can be fine-tuned; neighbors and committees, the tuning parameters are established as follows: the optimum hyper-parameter committees is selected in the range of [1:100] incrementing one at a time and the optimum hyper-parameter neighbors are selected in the range of [0:9] incrementing one at a time. To tune the Cubist model over different values of neighbors and committees, the train function in the caret package [18] can be employed to optimize the previously mentioned parameters. A smaller value of RMSE indicates better optimization results of the Cubist models. The relationship between committees and neighbors is shown in Figs. 76.3 and 76.4. The finally determined values for the Cubist models (Figs. 76.3 and 76.4) are: committees = 1 and 89; neighbors = 5 and 9 in the models with and without introducing MAP as predictor, respectively.
The tuned hyperparameters, the R packages, the functions and the ranges of parameters for both models are summarized in the Table 76.2.
3.2 Results of the Test Set
To confirm the predictive models based on the predicted and observed values, 21 testing cases were validated by the Simple Linear Regression (SLR) and two Cubist models. Figures 76.5, 76.6 and 76.7 and Table 76.3 illustrate the results. Figures 76.5, 76.6 and 76.7 show the observed versus predicted index rainfall by SLR and the Cubist models using test data. In Table 76.3, The R, RMSE and MAE of the SLR for 21 sets of testing data are 0.820, 6.045 mm and 4.582 mm, respectively. The R, RMSE and MAE of the first Cubist model (with introducing MAP variable as predictor) for 21 sets of testing data are 0.842, 4.860 mm and 3.934 mm, respectively.
While, the R, RMSE and MAE of the second Cubist model (without introducing MAP variable as predictor) for 21 sets of testing data are 0.760, 5.643 mm and 3.709 mm, respectively.
The performance has been assessed by the SLR model and the two of Cubist modeling approaches. In case of the gauged regions and the availability of precipitation data like annual values, the SLR and the first Cubist model are required to predict the index rainfall. Nevertheless, the second Cubist model is recommended in case of the ungauged regions, without losing much performance.
4 Conclusion
This work surveys the ability of the simple linear regression and the Cubist rule-based regression techniques for estimating the index rainfall. It demonstrates the relevance of the idea, which investigates the use of geographical information of rain gauge stations as the predictors. The study concluded that the SLR approach is more suitable for predicting the index rainfall in gauged regions. However, the Cubist model can be an acceptable alternative way using only the geographic predictors to predict the index rainfall in ungauged regions.
References
Hosking JRM, Wallis JR (1997) Regional frequency analysis: an approach based on L-moments. Cambridge University Press, Cambridge
Chow VT (1951) A general formula for hydrologic frequency analysis. Trans Am Geophys Union 32(2):231–237
Hershfield DM (1961a) Rainfall frequency atlas of the United States. Technical paper no. 40, Weather Bureau, United States Department of Commerce, Washington, DC
Hershfield DM (1961) Estimating the probable maximum precipitation. J Hydraulics Div Proc Am Soc Civil Eng 87:99–106
Hershfield DM (1965) Method for estimating probable maximum precipitation. J Am Water Works Assoc 57:965–972
Sammen S, Mohamed T, Ghazali A, Azlan Abdul Aziz LS (2018) Estimation of probable maximum precipitation for tropical catchment. MATEC Web conference, vol 162, p 03012. https://doi.org/10.1051/matecconf/201816203012
ANRH (2008) Etude générale des crues du Nord de l'Algérie, Modélisation des débits de crue, Juillet 2008. Algeria, 73 p
Laborde JP (1998) Eléments d'hydrologie de surface. Cours polycopié de l'Université de Nice-Sophia Antipolis. Nice, France, 215 pages
Rezak S (2014) Hydrologie Algérienne : Synthèse des apports de crues sur SIG. Dissertation, University of Oran, Algeria
Houichi L (2017) Appropriate formula for estimating rainfall intensity of selected duration and frequency: a case study. Larhyss J 30:67–87
Body K (1981) Analyse fréquentielle des pluies de l’Algérie—Synthèse régionale : Détermination des paramètres principaux par station et leur répartition spatiale. INRH Constantine. Algeria.
Chaumont M, Paquin C (1971) Notice explicative de la carte pluviométrique de l’Algérie au 1/500.000. Société d’histoire naturelle de l’Afrique du Nord, Algeria. https://doi.org/10.1029/TR032i002p00231
Kuhn M, Johnson K (2013) Applied predictive modeling. ISBN: 978-1-4614-6848-6. https://doi.org/10.1007/978-1-4614-6849-3
Quinlan R (1993) Combining instance-based and model-based learning. In: Proceedings of the tenth international conference on machine learning, pp 236–243
Zhang Y, Yang Y (2015) Cross-validation for selecting a model selection procedure. J Econom 187(1):95–112. https://doi.org/10.1016/j.jeconom.2015.02.006
Carslaw DC, Ropkins K (2012) Open air- an R package for air quality data analysis. Environ Model Softw 27–28:52–61. https://doi.org/10.1016/j.envsoft.2011.09.008
Carslaw DC (2015) The open air manual-open-source tools for analysing air pollution data. Manual for version 1.1–4. King’s College London. http://www.openair-project.org
Kuhn M (2008) Building predictive models in R using the caret package. J Stat Softw 28(5):1–26. https://doi.org/10.18637/jss.v028.i05
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Tarfaya, C., Houichi, L. (2022). Prediction of Index Rainfall Using a Cubist Model: A Case Study of Cheliff Watershed (Algeria). In: Gourbesville, P., Caignaert, G. (eds) Advances in Hydroinformatics. Springer Water. Springer, Singapore. https://doi.org/10.1007/978-981-19-1600-7_76
Download citation
DOI: https://doi.org/10.1007/978-981-19-1600-7_76
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-1599-4
Online ISBN: 978-981-19-1600-7
eBook Packages: Earth and Environmental ScienceEarth and Environmental Science (R0)