Introduction

Rainfall is the oldest and most commonly recorded climate variable and is a very valuable indicator for studying climate change, water resource management, irrigation scheduling, flood prevention, and the construction of hydraulic structures (Tabari and Talaee 2011; Tabari et al. 2012; Kebede et al. 2014; Nkiaka et al. 2016; Melanie and Maria 2018). In addition, in order to adequately equip these studies, the correct estimation of hydrological events uses frequency analysis to predict the rainfall that corresponds to certain return times T (quantiles) such as floods and low flows (Karlsson et al. 2016).

According to several articles, many hydrological applications rely on knowledge of these events. Unfortunately, rainfall data remain limited in both time and space, which does not always yield reliable estimates (Cantat 2004). These studies should be based on the series free of missing data and heterogeneity (Bigot 2002; Faizah et al. 2016). Since there is no perfectly reliable and continuous dataset, some uncertainty will remain (Cantat 2004). But for series with gaps, how and what is the reliability of the reconstituted series?

Missing data is a common problem in most areas of scientific research and remains major in Hydrology and Climatology Science. They may result from different human and material sources. These errors are critical because they affect the continuity of precipitation data and ultimately influence the results of hydrological models using precipitation as inputs (Lee and Kang 2015). This problem appears to be more widespread in the developing countries than in the developed countries, particularly Algeria due to various causes such as (i) frequent failures in measuring equipment, (ii) the total closure of some rain stations, and (iii) the gaps on a daily or monthly scale, therefore lead to gaps on an annual scale. Therefore, the evaluation of missing data is an important task for designing hydrological models (Dastorani et al. 2010; Ouarda et al. 2008).

Rubin (1976) defined the missing data according to three failure mechanisms: data missing completely at random (MCAR) when the probability that an instance (case) has a missing value for a variable does not depend on the known values or missing data. Data missing at random (MAR) when the probability of an instance with a missing value for a variable may depend on the known values but not the value of the missing data itself. Data are missing not at random (MNAR) when the probability that an instance has a missing value for a variable may depend on the value of that variable (Little and Rubin 2002).

Missing data may affect the properties of statistical estimators such as means, variances, or percentages, which lead to a loss of power and disastrous misleading conclusions especially for the prediction of extreme events and quantiles (El Methni 2013). A variety of techniques have been proposed to replace missing values with statistical prediction; this process is usually called “imputation of missing data” (Little and Rubin 2002; Audigier et al. 2015).

Various techniques have been used to estimate missing data mainly simple imputation and multiple imputations (Presti et al. 2010; Audigier et al. 2016).

The first solutions provided by researchers to simply manage the problem of missing data were to use simple imputation methods (Audigier et al. 2015). The problem of simple imputation methods is that no distinction is made between the observed data and the imputed data. In 1977, Donald Rubin has proposed idea of multiple imputation technique. The first theoretical work on multiple imputations was subsequently launched in 1987 (Little and Rubin 1987). Since 2005, the scientific community (Van Buuren 2012) has accepted multiple imputation, and the number of publications on this subject grows exponentially. Today, multiple imputation methods are numerous. They differ in particular from the imputation models they use (Sattari et al. 2017). Today, most published articles focus on developing new imputation methods (Brock et al. 2008; Luengo et al. 2012). But, few studies deal with the effect of the rainfall series’ imputation methods on the quantiles.

In this study, we have compared and evaluated four different variants of simple imputation based on principal component analysis (PCA): probabilistic PCA (PPCA), expectation maximization PCA (EMPCA), regularized PCA (RPCA), and singular value decomposition PCA (SVDPCA), according to four evaluation criteria: root mean square error (RMSE), mean absolute error (MAE), quadratic error (EQR), and correlation coefficient (CC). The objective here is not to apply a statistical method on an incomplete table but to evaluate the properties of the four simple imputation methods based on the principal components analysis. Therefore, we have focused on the quality and effect of prediction of missing data and quantiles in data processing.

Study area and data

Study area

The study area is the whole northern extent of Algeria, which is approximately between 34° N and 38° N in latitude and between 2° W and 8° E longitude. Spread over 15 watersheds (Fig. 1) characterized by different climates. The northern zone of Algeria is characterized by a Mediterranean climate with a cold and rainy winter and a hot and dry summer. The annual rainfall is on average 436 mm in the west (Tlemcen), 648 mm in the center (Dar el Beida), 512 mm in the east (Constantine), and 1000 mm for the coast (Jijel).

Fig. 1
figure 1

The stations in the study area

Data

Annual rainfall series from 30 stations were obtained from the National Meteorological Office (NMO) and National Water Resource Agency (NWRH), and a record length of 69 years (1936/1937–2004/2005) was considered. This period is the maximum common time period of precipitation data recorded. The information about the stations are presented in Tables 1 and 2. The geographical locations of the stations are shown in Fig. 1.

Table 1 Ranges of variables considered in study
Table 2 Geographic characteristics of the selected rainfall stations in Northern Algeria

Methods

The data of 30 rainfall stations for the 1935–2004 periods (69 years) were used to generate and impute deficiencies according to missing completely at random (MCAR) hypothesis using the package missMDA of the free R software (Josse and Husson 2016).

The R software provides a powerful and comprehensive system for analyzing data, used in conjunction with the R-commander (a graphical user interface, commonly known as Rcmdr); it also provides one that is easy and intuitive to use (Suzuki and Shimodaira 2006).

Gap generation and principle of the analysis

First of all, gaps were generated with the Library “miss Forest” (Stekhoven and Bühlmann 2011) ProdNA algorithm missing completely at random “MCAR” at different percentages 10, 20, 30, and 40% from observed data, so-called reference data. From the original datasets (without missing values), we introduced in the data a varying percentage of missing values (from 10 to 40%) generated MCAR assumption. These simulated missing values were imputed using four methods and four evaluation criteria: RMSE, MAE, EQR, and the CC were measured, and difference between the replaced values and the original true values was evaluated.

Imputation

Four PCA simple imputation methods were selected to cover techniques widely applied in the literature and representative of various statistical strategies.

Expectation maximization PCA

EM is a general algorithmic approach to manage latent variable models (including mixtures) popular in large part because it is typically highly scalable and easy to implement (Lin 2010).

Probabilistic PCA

PPCA combines an EM approach for PCA with a probabilistic model. The EM approach is based on the assumption that the latent variables as well as the noise are normally distributed. In standard PCA data, which is far from the training set but close to the principal subspace, may have the same reconstruction error. PPCA defines a likelihood function such that the likelihood for data far from the training set is much lower, even if they are close to the principal subspace. This allows to improve the estimation accuracy. PPCA is tolerant to amounts of missing values between 10 to 15%. If more data is missing, the algorithm is likely not to converge to a reasonable solution (Stacklies and Redestig 2017).

Regularized PCA

Regularized PCA is based on the regularized iterative algorithm, which allows to obtain a point estimate of the parameters and to overcome the major problem of the unfit (Josse et al. 2012).

Singular value decomposition PCA

This implements the SVD impute algorithm as proposed by Troyanskaya et al. (2001). The idea behind the algorithm is to estimate the missing values as a linear combination of the k most significant eigengenes. The algorithm works iteratively until the change in the estimated solution falls below a certain threshold. Each step, the eigengenes of the current estimate are calculated and used to determine a new estimate. An optimal linear combination is found by regressing an incomplete variable against the k most significant eigengenes. If the value at position j is missing, the value of the eigengenes is not used when determining the regression coefficients. SVD impute seems to be tolerant to relatively high amount of missing data (> 10%).

Results and discussion

Performance of the estimation methods

In this study, the comparison was made on the pluviometric series of real data for 10, 20, 30, and 40% gaps. The performances of the estimation methods used are compared and assessed using four measures of performance. The RMSE, MAE, EQR, and CC as criteria to choose the best method of imputation, which have been selected to cover techniques widely applied in the literature and representative of various statistical strategies (Boke 2017). The error measures the difference between the estimated values (predicted) and their corresponding observed values. The four error indices are given according to the following expressions:

$$ RMSE={\left[\sum \limits_{i=1}^n\frac{{\left( PanObs- Pan\Pr ed\right)}^2}{n}\right]}^{0.5} $$
(1)
$$ MAE=\frac{1}{n}\sum \limits_{i=1}^n\left| Pan\Pr ed- PanObs\right| $$
(2)
$$ \sum {EQR}^2=\sum {\left( PanPObs- Pan\Pr ed\right)}^2 $$
(3)
$$ CC=\frac{\sum_{i=1}^n\left( Pan Obs-\overline{Pan}\right)\;\left( Pan\Pr ed-\overline{Pan}\right)}{\sqrt{\sum_{i=1}^n{\left( Pan Obs-\overline{Pan}\right)}^2}\kern0.24em {\sum}_{i=1}^n{\left( Pan\Pr ed-\overline{Pan}\right)}^2} $$
(4)

where (PanObs) is the amount of precipitation observed. (PanPred) is the expected predicted value of precipitation (in this case, it is the imputed value of precipitation). (\( \overline{Pan} \)) is the means of precipitation and n is the number of neighboring station.

The results of the performance of the estimation methods are shown numerically and graphically. Table 3 and Fig. 2 show, respectively, numerical and graphical assessment of simple imputation methods for various percentages of missing values using as criteria: RMSE, MAE, EQR, and CC.

Table 3 Comparison of estimation methods based on RMSE, CC, MAE, EQR, and number of principal component (NCP) used with four different percentages of missing values after imputation
Fig. 2
figure 2

Assessment of simple imputation methods for various percentages of missing values using four measures of performance criteria. a RMSE. b MAE. c EQR. d Correlation coefficient (CC)

For each percentages of missing data (from 10 to 20%), the performance of each estimation of four methods (PPCA, EM, regularized, and SVD) tends to decrease for RMSE, EQR, and MAE values, resulting in the increment in CC coefficient. While, for each percentage from 20 to 40%, the performance of each estimation of four methods (PPCA, EM, regularized, and SVD) tends to increase for RMSE, EQR, and MAE values, resulting in the decrease in CC coefficient. The regularized method is found to be the best for four estimation methods used, and the EM method is the second best based on their values of the four error indices of 10 to 40%. The lowest performances are given by the SVD and PPCA methods.

Influence of the imputations on the quantiles

According to the above performance, the regularized variant proves to be the best for imputation; nevertheless, other estimates after the filling of a rainfall series are necessary to predict hydrological events using frequency analysis.

In this context, is it always the best valid imputation method for quantile estimation?

In order to answer this pertinent question, we have been interested in the estimation of quantiles.

To avoid calculation for all stations (30 stations), we preferred to proceed to a hierarchical classification by Ward method based on the results of a principal component analysis (Brito et al. 2016). The FactoMiner package of Free Software R was used for this purpose (Lê et al. 2008).

The classification of individuals (stations) into four classes is based on the use of the mean rains of the 12 months of the year over the period of 69 years as active variables (the values of the latter are not mentioned here). Geographic coordinates (latitude, longitude) as well as altitude and interannual monthly totals are taken as additional variables (Fig. 3).

Fig. 3
figure 3

PCA circle of correlations

Each of the four classes is represented by a station called “Paragon” (Lê et al. 2008). The paragon is an individual (station) which characterizes on average all the characteristics of its corresponding class.

For this purpose, all the analysis will be done only on the four synoptic stations representative of their classes.

Classification of rainfall stations

After a PCA and classification of rainfall stations according to the criteria altitude, attitude, and mean rains of the 12 months, we allowed to have four clusters.

Clusters 1 and 4, respectively, contain 11 and 3 stations, on the other hand, clusters 2 and 3 each contain 8 stations, respectively, illustrated in Fig. 4 and Table 4.

Fig. 4
figure 4

Hierarchical cluster analysis

Table 4 Classification of rainfall stations and their paragons

Each cluster is carried by a synoptic station called “Paragon,” and the quantiles for the four Paragon stations (Mascara, Batna, Blida and Jijel) were estimated for return periods of 5, 10, 20, 50, 100, 500, and 1000 years, using the normal distribution law, for four PCA imputation variants.

Effect of filling on quantiles

The results and effect of filling on quantiles observed and predicted are shown. Table 5 shows numerical values of predicted quantiles according to return periods for the four Paragon stations (Mascara, Batna, Blida, and Jijel) based on simple imputation methods for various percentages of missing values.

Table 5 Quantiles observed and calculated with PCA methods according to return periods for the fourth station for 10 to 40% of filling, (a) Mascara, (b) Batna, (c) Blida, and (d) Jijel

For the Maskara station, Table 5(a) shows that the EM and regularized methods for 10, 30, and 40% of the missing data give a good estimate of the predicted quantiles compared to the observed one with an acceptable positive or negative margin. Also, for 20% of missing data, these methods give a good estimation and the same values of predicted quantiles compared to observed values.

For Batna station, Table 5(b) shows that EM and regularized methods for 10 and 30% of missing data give a good estimation and the same values of predicted quantiles compared to observed quantiles. Also, for 20 and 40% of missing data, these methods give a good estimation of predicted quantiles compared to observed ones with an acceptable positive or negative margin.

For Blida station, Table 5(c) shows that EM and regularized methods for 10 to 40% of missing data give a good estimation of predicted values of quantiles compared to observed values with an acceptable positive or negative margin.

For Jijel station, Table 5(d) shows that EM and regularized methods for 10 to 40% of missing data give a good estimation of predicted quantiles compared to observed quantiles with an acceptable positive or negative margin.

Finally, for each percentage of missing data (from 10 to 40%), the regularized method is found to be the best for four estimation methods used and the EM method is the second best; the lowest performances are given by the SVD and PPCA methods based on their values of the two performance criteria, CC and relative error (RE) indices.

CC and RE of observed quantiles with quantiles after filling

CC for the annual rainfall series filled with the variants of the PCA for 10 to 40% of quantiles observed with quantiles after filling for Paragon stations (Mascara, Batna, Blida and Jijel) are illustrated in Table 6. The values of the CC are acceptable and vary between 0.66 to 0.97 for EM and 0.74 to 0.97 for regularized.

Table 6 Correlation coefficient of quantiles observed with quantiles after filling for the fourth station

RE for the annual rainfall series filled with the variants of the PCA for 10 to 40% of observed quantiles with quantiles after filling for Paragon stations (Mascara, Batna, Blida, and Jijel) are illustrated respectively in Table 7.

Table 7 Percentage (%) of relative error of quantiles observed with quantiles after filling for the fourth station for 10 to 40% of filling, (a) Mascara, (b) Batna, (c) Blida, and (d) Jijel

The values of the RE for Mascara station vary between 1.7 and 3.4% for EM and 0.20 and 3.5% for regularized (Table 7(a)).

The values of the RE for Batna station vary between 0.17 and 2.71% for EM and 0.46 and 3.64% for regularized (Table 7(b)).

The values of the RE for Blida station vary between 1.32 and 4.63% for EM and 3.15 and 4.74% for regularized (Table 7(c)).

The values of the RE for Jijel station vary between 0.59 and 3.89% for EM and 0.91 and 4.19% for regularized (Table 7(d)).

Conclusion

In the present study, a comparison of four simple imputation methods (probabilistic PCA, expectation maximization PCA, regularized PCA, and singular value decomposition PCA) is performed, based on a real dataset of different rainfall stations in Algeria according to the MCAR hypothesis. The validation of the results and the choice of the best method of imputation is an important step, so the prediction performances of the four methods are assessed by different statistical criteria like root mean square error, mean absolute error, quadratic error, and correlation coefficient. The study examined the effect of the simple imputations on the quantiles of the rainfall series of 30 stations for the period ranging from 1935 to 2004 (69 years), located in northeast Algeria. The results of the imputations for four different percentages of missing values (PMVs), namely, 10, 20, 30, and 40%, suggests that the regularized PCA and expectation maximization PCA are the best methods which could be used with success to filling gaps. The singular value decomposition PCA and the probabilistic PCA methods (Table 3, Fig. 2) give the lowest performances. Moreover, the regularized PCA and expectation maximization PCA methods are the best in estimating of quantiles compared to the reference observed one, for the four Paragons determined by the cluster analysis, and result in very good to acceptable predicted quantiles regarding the values of CC and RE, such as (CC = 0.97 with 10% of PMV and CC = 0.66 with 40% of PMV; RE = 4.74% with 10% of PMV and RE = 3.82% with 40% of PMV).