1 Introduction

Rainfall is one of the most important hydrological parameters used in most hydrological and climatological studies (Kamaruzaman et al. 2017; Lee and Kang 2015). However, the occurrences of missing data are critical and unavoidable in various fields of research. Missing data may be contributed by human errors in managing the datasets, equipment failure and natural disasters that may damage the gauging equipment on site. The direct impact of having missing data will be the lack of input data or samples for performing any simulations. Consistent and complete rainfall datasets are required to obtain accurate hydrological simulation and prediction studies (Jajarmizadeh et al. 2015; Sattari et al. 2016). Thus, missing rainfall data need to be handled carefully to enhance the reliability of hydrological studies.

To address the missing rainfall observations, listwise deletion, pairwise deletion, zero imputation, and hot deck imputation are commonly adopted (Kamaruzaman et al. 2017; Pagano et al. 2014). However, these methods are yet proven to be reliable, accurate and scientifically approved. Listwise deletion and pairwise deletion eliminate the missing observations. Hence, using any data deletion methods will cause the loss of information and reduction in sample size. Zero imputation includes the substitution of missing observations with zeros. Replacing the missing entries with zeros will disrupt the nature of the data distribution. Thus, it may create bias and error in further studies. Zero imputation may be applicable in some of the hydrological parameters such as rainfall. However, it may not be suitable to be used in ground water level and studies with negative values (Gill et al. 2007). “Hot deck” imputation method is more reliable when compared to listwise deletion, pairwise deletion, and zero imputation. It is currently used in Malaysia to replace missing observations with available observations from other nearby gauging equipment or rainfall stations (Malek et al. 2010). However, this method is not reliable if the missing observations occurred simultaneously at the other gauging equipment and nearby rainfall stations. All these methods may create biases and result in unreliable and inaccurate studies.

Throughout the research on the impact of missing data, it is encouraged to implement data imputation during the data pre-processing process to boost the performance of the prediction studies. Ekeu-wei (2018) performed an experiment to estimate flood by adopting datasets with missing and imputed observations. The imputed observations were predicted using Monte Carlo Multiple imputation approach. The results show that using imputed observations can boost the accuracy of flood estimate consistently. The findings also suggest that using datasets with missing observations will cause underestimation and overestimation of flood estimate. Kuok and Bessaih (2007) used artificial neural network (ANN) to predict the daily rainfall runoff of Sungai Bedup Basin. The results indicate that ANNs performed better with the increased supply of input data. These findings highlight the importance of sustaining consistent and long term climatological and hydrological data. The literatures also emphasized on the implementation of data imputation to increase the data availability. By doing so, it can boost the performance and accuracy of simulation and prediction studies.

Statistical approaches, data mining approaches and machine learning approaches such as ANN and K-nearest neighbour (KNN), are some of the approaches that can be used to perform data imputation. Oba et al. (2003) created Bayesian Principal Component Analysis (BPCA) to address the missing values of gene expression profile data. The BPCA model outperformed the KNN impute and singular value decomposition method (SVD) in imputing the missing data. Bennett et al. (2007) used nearest neighbour by distance (ND) and correlation (NC), inverse distance weighted (IDW), average of gauges selected by correlation (A), and weighted average of gauges selected by correlation (WA) to impute the missing rainfall data. The results showed that WA method outperformed all the other proposed methods.

ANNs are also applied in the hydrological field for performing data prediction tasks. Luk et al. (2001) used multilayer feedforward network (MLFN), partial recurrent neural network (PRNN), and time delay neural network (TDNN) to predict the rainfall values. Kuok and Bessaih (2007) used multilayer perceptron (MLP) and recurrent (REC) network to estimate the daily rainfall runoff of Sungai Bedup Basin. Particle Swarm Optimisation Feedforward Neural Network (PSONN) was also created by Kuok et al. (2010) to calibrate the water tank model and the relationship of the rainfall runoff model at Sungai Bedup Basin. Chai et al. (2017) estimated the rainfall data by using six daily meteorology data and two types of neural networks: Backpropagation Neural Network (BPN) and Radial Basis Function Network (RBFN).

In this study, the potential of using BPCA imputation model to treat missing rainfall data is investigated. The accuracy and robustness of BPCA imputation model is expected to be the major challenge in this study. Accurate prediction of hydrological data such as rainfall data is challenging due to their high degree of temporal and spatial variability, and non-linear characteristics (Bennett et al. 2007; Chai et al. 2017; Gill et al. 2007). Furthermore, different climate zones have different rainfall pattern and spatial distribution. This increases the challenge in reconstructing the missing rainfall data because different climate zones have different best imputation method (De Silva et al. 2007). To the knowledge of the authors, there is no published work that applies BPCA imputation model to treat missing rainfall data. The BPCA imputation model is known to have good imputation performance in the medical domain. Hence, it is motivated to study the application of BPCA algorithm in patching the missing rainfall data. Considering all the issues and challenges, there is a need to develop a novel approach to apply and boost the imputation performance of BPCA model in treating the missing rainfall data. As such, the objectives of this study are aligned as below:

  • To predict the missing rainfall data using BPCA model and rainfall data

  • To study the parameters that will affect the performance of BPCA model

  • To evaluate the performance of BPCA model with the introduction of reference rainfall data from neighbouring rainfall station

  • To compare the performance of BPCA model with existing imputation model, KNN within the study area

2 Study Area and Rainfall Stations

The Kuching City in Sarawak, Malaysia was chosen as the study area of this research study. The rainfall data within Sarawak River Basin were adopted in this study. The distance between the rainfall stations was set to be the benchmark for selecting the neighbouring stations. The rainfall data from further stations are expected to have large difference in terms of spatial and temporal distribution that may lower the imputation performance. Hence, only the rainfall data from neighbouring rainfall stations were selected in this study.

The rainfall stations at Kuching Saberkas (1), Kuching Third Mile (2), Ulu Maong (3), and Kuching Airport (4) were selected in this study. The location of the selected rainfall stations was illustrated in Fig. 1. They are relatively close to one another when compared to other available stations in Kuching. The daily rainfall data in the year 1991 were collected from Department of Irrigation and Drainage (DID) Sarawak. The rainfall data from the four stations were analysed to study the impact of distance and correlation of data between the neighbouring rainfall stations on the imputation performance of BPCA model. The correlation coefficient (r) of the rainfall data between stations were calculated using Eq. (1).

$$ r=\frac{\sum \left(A-\overline{A}\right)\left(B-\overline{B}\right)}{\sqrt{\sum {\left(A-\overline{A}\right)}^2\sum {\left(B-\overline{B}\right)}^2}} $$
(1)

where,

A:

data from Station A

B:

data from Station B

\( \overline{\mathrm{A}} \) :

mean of the data from Station A

\( \overline{\mathrm{B}} \) :

mean of the data from Station B

Fig. 1
figure 1

Selected rainfall station

2.1 Data Correlation between the Selected Rainfall Stations

The correlations of r ≥ 0.7, 0.4 ≤ r < 0.7, and r < 0.4 are defined as a high, medium, and low correlation relationship, respectively. The coefficient of correlations between the stations are tabulated in Table 1. From Table 1, it shows that the rainfall datasets are considered as highly correlated because all the r fall between the range of 0.75–0.97. It is observed that r decreases as the distance between the rainfall stations increases. The pairing of stations with Kuching Airport Station result in lower r than the pairs without Kuching Airport Station. This may be due to the small difference in rainfall received between the three rainfall stations other than Kuching Airport Station. The collected data shows that the rainfall amount received at Kuching Airport Station is significantly lower than the other rainfall stations. Another reason may be due to the geographical location of the rainfall stations. Ulu Maong Station is closer to Kuching Airport Station when compared to Kuching Saberkas Station and Kuching Third Mile Station. Thus, the correlation between Kuching Airport Station and Ulu Maong Station is higher than the other two stations.

Table 1 Calculated correlation between the rainfall stations

3 Imputation Models

3.1 Bayesian Principal Component Analysis (BPCA)

The BPCA imputation model that is created by Oba et al. (2003) considers the whole dataset of gene expression profiles by a matrix, Y. Y is arranged in the order of (D × N). N and D are known as the number of genes and the number of samples, respectively. The prediction of missing values is executed based on three elementary processes: principal component (PC) regression, followed by Bayesian estimation and the expectation-maximization (EM) like repetitive algorithm. The first two steps, PC regression and Bayesian estimation, are used for deriving, determining and setting up appropriate parameters. The missing values estimation will only be carried out at EM like repetitive algorithm represented by Eq. (2). The details of the derivations and assumptions had been outlined by Oba et al. (2003) and Oba (2013).

$$ {\widehat{Y}}^{miss}=\int {Y}^{miss}q\left({Y}^{miss}\right)\ d{Y}^{miss} $$
(2)

where,

\( {\widehat{Y}}^{miss} \) :

imputed missing variables of matrix Y

Y miss :

missing variables of matrix Y

q(Ymiss):

posterior distribution of missing value

The BPCA imputation model had been applied widely in the field of biomedical for patching the missing microarray data. Shi et al. (2013) proposed a new hybrid imputation method that utilised both BPCA imputation and Local Least Square (LLS) imputation. The proposed method was named as Bayesian Principal Component Analysis and Iterative Local Least Square method (BPCA-iLLS). The BPCA-iLLS model outperformed the BPCA model and LLS model. However, the performance of BPCA and LLS models varied significantly when different datasets were used to perform the imputations. The literature also showed that LLS tends to outperform BPCA when dominant local similarity exists within the dataset. On the other hand, BPCA works better when the datasets have lower complexity. Another similar approach had been done by Severson et al. (2017). Several Principal Component Analysis (PCA) based methods were introduced and evaluated for imputing the missing microarray data. The methods that had been used in their studies were mean imputation, alternating least squares (ALS), singular value decomposition method (SVDImpute), probabilistic principal component analysis (PPCA), PCA-data augmentation (PCADA), PPCA-M (another variation of PPCA), BPCA, singular value thresholding (SVT), another variation of alternating least square (Alternating), and Lagrange multiplier method (ALM). It was mentioned that the SVDImpute and the probabilistic methods (PPCA, PPCA-M, and BPCA) performed the best overall. However, it was suggested that the suitability of the methods chosen for performing the imputation may vary. The missingness mechanism is found to be the main factor that affects the suitability of the imputation methods.

The application of BPCA is not only limited within the biomedical field. It was also utilised for imputing the missing data of total electron content (TEC) Ionospheric satellite dataset. Under the work performed by Subashini and Krishnaveni (2011), the BPCA model was proven to be better than KNN imputation for imputing the missing TEC data. Other than imputation, BPCA was also applied for speech feature analysis. Oh-Wook et al. (2003) had proposed variational BPCA to estimate the speech feature dimensionality and the number of clusters used in Gaussian mixture model. The literatures imply that it is possible to implement BPCA in other fields of research. Hence, BPCA imputation model is introduced in this paper to impute the missing rainfall data.

3.2 K-Nearest Neighbour (KNN)

Lee and Kang (2015) patched the missing rainfall data using KNN regression with five different kernel estimation functions (Epanechnikov, Quartic, Triweight, Tricube and Cosine). The imputed rainfall datasets were then used to simulate the water runoff using Soil Water Assessment Tool (SWAT). The study showed that KNN can be applied to patch the missing hydrological data. It is also significant that utilising different kernel functions can improve the performance of KNN imputation in predicting the missing rainfall data. By doing so, it actually helps to enhance the accuracy of streamflow simulations.

As such, KNN imputation method is introduced in this paper to compare the performance of BPCA and KNN. The purpose of comparing the performance of BPCA and KNN is to observe the reliability and robustness of BPCA. The performance of KNN in missing data imputation had been proven to be reliable in both biomedical field and hydrological field. A built-in KNN imputation function in MATLAB, “knnimpute” was adopted in this research study. The KNN imputation function will impute the missing data by referring to the reference values from the nearest neighbour column with no missing values. The nearest-neighbour column is determined by identifying the Euclidean distance as shown in Eq. (3).

$$ Euclidean\kern0.17em distance=\sqrt{\sum \limits_{i=1}^n{\left({q}_i-{p}_i\right)}^2} $$
(3)

where p and q are the vectors of two different datasets.

4 Methodology

The missingness mechanism in this study was assumed to be Missing Completely at Random (MCAR). Malek (2008) stated that the cause of missing rainfall data in Malaysia is mainly due to errors and mistakes in data management, human resources, instrumentation, operation and maintenance. Hence, the missing rainfall data is not caused by the occurrences of random events. In order to evaluate the ability of the imputation models, the general experiment procedures were outlined as below:

  • Step 1: Collection of daily rainfall data from DID Sarawak

  • Step 2: Creation of six different input datasets without any missing values

  • Step 3: Introduction of artificial missing entries for all the datasets (1%, 5%, 10%, 15% 20%, 25% and 30% of missing rainfall data entries)

  • Step 4: Import the rainfall data and source code into MATLAB

  • Step 5: Execution of the imputation under different parameters settings (different K values and percentage of missing data entries)

  • Step 6: Evaluation on the performance of BPCA model and KNN model using different evaluation methods

The selected rainfall data were arranged into six different input datasets. The datasets were created by combining the daily rainfall data of different neighbouring stations in a matrix form of (X × Y). X and Y represent daily rainfall amount and months, respectively. The motivation of creating different input datasets is to observe and evaluate the performance of imputation models under the increment of data availability. The relationship between the data correlation and imputation performance can also be observed with the utilisation of different input datasets. The input datasets were created by setting Kuching Third Mile rainfall station as the imputation and evaluation target. The missing data entries of 1%, 5%, 10%, 15%, 20%, 25% and 30% were artificially created and introduced in the rainfall data of Kuching Third Mile. The combination of datasets can be observed in the following list:

  1. 1.

    Kuching Third Mile

  2. 2.

    Kuching Third Mile & Kuching Saberkas

  3. 3.

    Kuching Third Mile & Ulu Maong

  4. 4.

    Kuching Third Mile & Kuching Airport

  5. 5.

    Kuching Third Mile, Kuching Saberkas & Ulu Maong

  6. 6.

    Kuching Third Mile, Kuching Saberkas, Ulu Maong & Kuching Airport

Similar to KNN imputation, BPCA also uses K as the selection parameter. The maximum adoptable K value will depend on the nature of the algorithms. The parameter, K, simply refers to the number of training samples that are needed to be referenced for performing the imputation. The performance of the imputation models was evaluated using Bias (Bs), Root Mean Square Error (RMSE) and Efficiency (E). The equations of the evaluation criteria are listed as in Eqs. (4) to (6). A perfect estimation of the missing observations will result in Bs = 1, RMSE = 0 and E = 1. The evaluation methods were selected based on relevant hydrological prediction studies as performed by Wang et al. (2016) (for Bs and RMSE) and Malek (2008) (for E). These evaluation methods account for the drastic and rapid behaviour change of convective precipitation field.

$$ Bias,{B}_s=\frac{\sum_{i=1}^N{F}_i}{\sum_{i=1}^N{O}_i} $$
(4)
$$ Root\kern0.17em mean\kern0.17em square\kern0.17em error, RMSE=\sqrt{\frac{\sum_{i=1}^N\ {\left({O}_i-{F}_i\right)}^2}{N}} $$
(5)
$$ Efficiency,E=\frac{\sum {\left(O-\overline{O}\right)}^2-\sum {\left(O-F\right)}^2}{\sum {\left(O-\overline{O}\right)}^2} $$
(6)

where,

F:

imputed value or predicted value

O:

original value or observed value

\( \overline{\mathrm{O}} \) :

mean of original value or observed value

\( \overline{\mathrm{F}} \) :

mean of imputed value or observed value

N:

number of data

5 Results and Discussion

The summary of the evaluation against the imputation models is tabulated in Table 2. To ease the difficulty of comparing the imputation performance for all the data combinations, Table 2 only tabulates the best imputation performance achieved by both of the imputation models at different experimental settings. Figures 2, 3, 4, 5, 6, and 7 illustrate the imputation performance of BPCA and KNN models at different K values and percentage of missing entries. Other graphs are not included in this paper as they show similar result patterns.

Table 2 Evaluation summary for BPCA and KNN imputation
Fig. 2
figure 2

Graph of Bs vs K (BPCA - Kuching Third Mile, Kuching Saberkas & Ulu Maong)

Fig. 3
figure 3

Graph of RMSE vs K (BPCA - Kuching Third Mile, Kuching Saberkas & Ulu Maong)

Fig. 4
figure 4

Graph of E vs K (BPCA - Kuching Third Mile, Kuching Saberkas & Ulu Maong)

Fig. 5
figure 5

Graph of Bs vs K (KNN – Kuching Third Mile, Kuching Saberkas & Ulu Maong)

Fig. 6
figure 6

Graph of RMSE vs K (KNN – Kuching Third Mile, Kuching Saberkas & Ulu Maong)

Fig. 7
figure 7

Graph of E vs K (KNN – Kuching Third Mile, Kuching Saberkas & Ulu Maong)

The performance of both KNN and BPCA do fluctuates as the K value increases. However, the range of K values utilised by BPCA is different from KNN. For BPCA, the range of K is defined to be equal to the number of column within the dataset (Oba 2013). Hence, the range of K values adopted for BPCA are different for each data combination. When only one rainfall dataset is used, the adoptable K values fall between the range of 1 to 12. The maximum adoptable K value increases by another 12 units when an additional rainfall data from one of the rainfall stations is added in. Figures 2, 3, and 4 show that the adopted K values fall within the range of 1 to 36. This is due to the utilisation of 3 rainfall datasets. Small fluctuation of performance is expected as the range of the K values obtained in this study is relatively small (maximum range of 1 ≤ K ≤ 48). Unlike the experiment conducted by Oba et al. (2003), large performance difference is observed as the K values fall within the range of 1 to 200. Figures 2, 3, and 4 also show that the best and similar imputation performance can be achieved at different K values. This is different from the results obtained in the experiment performed by Oba et al. (2003). The results show that in gene profile data imputation, the BPCA model performed the best at K = D - 1. This might be due to the nature of the algorithm or rainfall data. The nature of rainfall data is much more random and has non-linear pattern. It is also observed that different occurrences of rain, rainfall amount and pattern were experienced by each of the rainfall station on the same timeline. They did not seem to be bounded or caused by any significant factor or reason.

For KNN, the maximum K value is not defined and definite. This requires the users to identify the convergence point to stop the increment of K value. To cope with this issue, KNN model is tested within the range of 1 ≤ K ≤ 50. It is found that the performance of KNN model for all the data combinations remain unchanged when the K value exceeds a value of 40 (in Figs. 5, 6, and 7). This suggests that the range of K in this study should fall within the range of 1 ≤ K ≤ 40 as further addition of K value is redundant. Similar to the performance of BPCA, the same imputation performance is achieved at different value of K. This is likely caused by the same issue as explained earlier on.

Generally, the performance of the BPCA and KNN models are quite similar in terms of all the evaluations performed in this study. This is because majority of the results tabulated in Table 2 are close with each other at their respective percentage of missing entries. The imputation performance is logical as it becomes worse when the percentage of missing entries increases. By referring to Table 2, the tabulated Bs values are not far from 1. This means that a slight overestimation (Bs > 1) and underestimation (Bs < 1) of data do occur on both KNN and BPCA. The accuracy of both models does improve when more data is provided for performing the imputation. The imputation performance achieves the lowest when only Kuching Third Mile’s data is being utilised for executing the imputation. The combination of “Kuching Third Mile, Kuching Saberkas & Ulu Maong” performed the best for both KNN and BPCA model. This phenomenon suggests that further addition of data can be redundant as the performance dropped upon further addition of the rainfall data from Kuching Airport Station. For the data combination of two rainfall stations, the combination of “Kuching Third Mile & Kuching Saberkas” outperforms the rest for both imputation models. This might be due to the fact that the Kuching Saberkas Station is the nearest to Kuching Third Mile Station. The correlation between the rainfall data of the two stations are also the highest. Highest correlation between the datasets simply means that the possibilities of similar rainfall pattern are the highest. Thus, it will result in better imputation performance. This effect is also significant as the imputation performance drops when using rainfall data of the stations located further away from Kuching Third Mile.

The superiorities of both KNN and BPCA vary at different missing entries. From Table 2, it shows that 90% of the results from KNN at the missing entries of 1–20% are better than BPCA. On the other hand, 66% of the results from BPCA at the missing entries of 25–30% are better than KNN. This means that the general performance of BPCA is only superior to KNN at the missing entries of 25% and above. In terms of conveniences, BPCA is better as the range of K values is well defined. This reduce the time required to identify the adoptable range of K values. As for KNN, the suitable range of K values is identified via trial and error method. But, the performance of KNN is more consistent as shown in Figs. 2, 3, 4, 5, 6, and 7. These findings suggest that the suitability of KNN and BPCA imputation model may vary at different situations or settings.

6 Conclusion

In this study, the performance of BPCA imputation model is reliable as it exhibits similar results as the KNN imputation model. The missing data entries, K value and number of reference data are found to be the parameters that will affect the imputation performance of KNN and BPCA. The results support the idea of using correlation and distance to select the rainfall data from the neighbouring rainfall stations to be added into the input dataset. Improvement of the imputation performance for both BPCA and KNN is evident upon the addition of reference data. The findings also suggest that the suitability of the application of BPCA and KNN imputation models is dependent on the situation. BPCA is found to be superior to KNN only at larger missing entries. The proposed method is recommended to be executed in other study area and other data mining or machine learning based imputation model. By doing so, it can help to determine if the proposed method is a viable alternative to boost their imputation performance.