Keywords

1 Introduction

Missing data is one of the common problems arise in the process of data collection. Incomplete data occurs for a variety of reasons, such as, interruption of experiments, equipment failure, measurement limitation, attrition in longitudinal studies, censoring, usage of new instruments, changing methods of record keeping, lost of records, and non response to questionnaire items [1, 2, 3, 5, 8, 11, 14, 15, 18, 21,22,23,24,25,26,27,28].

There are various types of data driven models including Artificial Neural Network (ANN) which could be used for the implementation of weather forecast. The ANN provides good approximation due to the capability of the network, dynamic and works well with non-stationary data. ANN is a popular method for many hydrological data analyses as demonstrated by many researchers [6, 7, 9, 12,13,14,15, 19]. The ANN has been shown to be one of the best methods for missing data prediction at par with fuzzy logic as a fuzzy rule-based approach. The estimation of missing rainfall data using ANN were compared with the results obtained using regression and other simpler techniques such as arithmetic and inverse distance method. Based on some previous studies, the ANN was chosen to be compared with other techniques due to its adequacy and reliable in predicting missing rainfall at particular gauge stations. Meanwhile, other methods requires much longer duration of time to estimate the missing values and the process involved much more complicated if compared with ANN [3, 21].

Thus, the aim of this study is to estimate the missing rainfall data by using Artificial Neural Network (ANN), Bootstrapping and Expectation Maximization Algorithm and Multivariate Imputation by Chained Equations (MICE). These methods were chosen due to the successfulness and the capability in handling the missing value problems as mentioned by some previous researchers in the field of study [2, 4, 6, 7, 9, 12, 15, 18, 19]. In evaluating those methods, three different level of missing values such as 5%, 10% and 15% will be considered. The performance of these methods is assessed using the Mean Absolute Error (MAE), Root Mean Square Error (RMSE) and Correlation of Determination (R2). In addition, the results of this study can provide knowledge to the researchers on several alternatives techniques for hydrological data in Malaysia and worldwide on which technique would give the best performance. Research and development in this study is conducted on continuous basis, thus there might be new findings on the issue of environmental as well.

2 Materials and Methods

2.1 Data Description

In this study, daily rainfall data were obtained from the Malaysian Meteorological Department and Drainage and Irrigation Department of Malaysia. The secondary data consists of daily rainfall amount (mm) for the period of 1975 up to 2017. Kuantan station which is located in the state of Pahang is chosen due to heavy seasonal rainfall and winds that affected most parts of Peninsular Malaysia during December 2014. The rains caused severe flooding in the East Coast region i.e. Terengganu, Pahang, and Kelantan states [17]. There were eight meteorological stations located in Pahang were selected such as Sekolah Menengah Ahmad Pekan, Felda Bukit Tajau, Felda Kampung New Zealand, Felda Sungai Pancing Selatan, Pusat Pertanian Tanaman Kampung Awah, Pusat Pertanian Bukit Goh and Mardi Sungai Baging as well as Kuantan station was chosen as the target station (Table 1).

Table 1. The geographical coordinates of the selected rainfall stations

2.2 Missing Data Imputation Analysis by Using Artificial Neural Network

In the process of imputation of missing values, the data was trained using ANN with three different learning algorithms, namely, conjugate gradient Fletcher–Reeves update (CGF), Broyden–Fletcher–Goldforb–Shanno (BFG) and Levenberg–Marquardt (LM). Three different units of analysis such as, daily, 10-day and monthly rainfall values, were used for evaluating the prediction ability of ANN. The output of the network was identified as the amount of precipitation at station X, px(t), with the inputs which was determined by the amount of the neighboring stations within the same duration in the period of time. The equation of the model assessment can be expressed as:

$$ p_{x} \left( t \right) = f\left[ {p_{1} \left( t \right), p_{2} \left( t \right), p_{3} \left( t \right), p_{4} \left( t \right),p_{5} \left( t \right),p_{6} \left( t \right),p_{7} \left( t \right)} \right] $$
(1)

where p1, p2, p3, p4, p5, p6 and p7 are denoted as the amount of rainfall at seven neighboring stations except Kuantan station. The ANN was trained and simulated using R Programming with the neuralnet package.

2.3 Missing Data Imputation Method by Using Bootstrapping and Expectation Maximization Algorithm

One of the advantages of Bootstrapping and Expectation Maximization Algorithm in Amelia package of R Programming is that, the combination of speed and the ease-of-use of algorithm with the power of multiple imputations will take into consideration.

The imputation model in Amelia package assumes that the complete data (that is, both observed and unobserved) are multivariate normal. It is denoted as the (n × k) dataset with D is defined as the observed part, Dobs and unobserved part, Dmis, where the assumption is given as the following equation:

$$ {\text{D}} \sim {\text{N}}_{\text{k}} \left( {\upmu,\Sigma } \right) $$
(2)

The state, D is defined as multivariate normal distribution with mean vector µ and covariance matrix Σ. The multivariate normal distribution is often to give a crude approximation to the true distribution of the data. Moreover, there is evidence to show that this model works as well as the other models which are more complicated and contained the mixed data. It has been reported by some previous researchers that multivariate normal model can provide valid estimate even though the assumptions is violated. This may be due to the large sample size and small percentage of missing values occurred in the dataset [3, 21]. Furthermore, transformations of many types of variables can often make this normality assumption more plausible.

2.4 Missing Data Imputation Analysis by Using Multivariate Imputation by Chained Equation

Multiple Imputation by Chained Equations (MICE) is a practical approach to generate missing rainfall values based on a set of imputation models. MICE is also known as fully conditional specification and sequential regression multivariate imputation. The three stages of MICE are described as below:

  1. i.

    Generating Multiple Imputed Data Sets

  2. ii.

    Analyzing Multiple Imputed Data Sets

  3. iii.

    Combining Estimates from Multiply Imputed Data Sets

The m estimates are combined into an overall estimate and variance–covariance matrix using Rubin’s rules, which are based on asymptotic theory in a Bayesian framework,

$$ \hat{\theta }_{j} = \frac{1}{m}\sum\nolimits_{j = 1}^{m} {\hat{\theta }_{j} } $$
(3)

2.5 Performance Measure Criteria

Three performance indicators which are Mean Absolute Error (MAE), Root Mean Square Error (RMSE) and Coefficient of Determination (R2) will be used to assess the imputation methods.

$$ MAE = \frac{1}{n}\sum\nolimits_{i = 1}^{n} {\left| {f_{i} - y_{i} } \right|} = \frac{1}{n}\sum\nolimits_{i = 1}^{b} {\left| {e_{i} } \right|} $$
(4)

where fi is the prediction and yi is the true value.

$$ RMSE = \sqrt {\frac{{\sum\nolimits_{i = 1}^{n} {\left( {X_{obs,i} - X_{model,i} } \right)^{2} } }}{n}} $$
(5)

Where Xobs is the observed values and Xmodel is the modeled values, at time or location i.

$$ R^{2} = 1 - \frac{RSS}{TSS} = 1 - \frac{{\sum {e_{i}^{2} } }}{{\sum {y_{i}^{2} } }} = \frac{ESS}{TSS} = \frac{{\sum {\hat{y}_{i}^{2} } }}{{\sum {y_{i}^{2} } }} $$
(6)

3 Results and Discussion

Referring to Table 2, the lowest mean of rainfall amount of 5.3 mm is recorded at Pusat Pertanian Tanaman Kampung Awah. Meanwhile, the highest mean of rainfall amount of 8 mm is observed at Felda Sungai Pancing Selatan. For the period of from 1975 up to 2017, it is observed that Kuantan station is having a complete of records with no missing values. Meanwhile, 9.4% of missing values is found from Felda Bukit Tajau station which is the most missing values recorded compared to the rest of neighboring stations.

Table 2. Descriptive statistics for the target station and the neighboring stations

The visualization of the computed Artificial Neural Network is shown below. The model has 3 hidden units including the neurons and layers. The black lines showed connections with the weights of the values which will be used to impute the missing values in the target station. In addition, the blue line demonstrated the bias item which is generated by the neural network estimation. The figure showed the results for the 5% generated missing values for Kuantan station (Fig. 1).

Fig. 1.
figure 1

Visualization by using Artificial Neural Network (Color figure online)

Figure 2 shows the graph of the comparison of the actual data and the predicted values imputed by using Artificial Neural Network (ANN) for 5% of missing values at Kuantan station. For Fig. 2 up to Fig. 4, the blue and red dots are denoted as the actual values and the predicted values, respectively.

Fig. 2.
figure 2

Predicted values by using Artificial Neural Network versus actual values in Kuantan station (5%) (Color figure online)

Figure 3 showed the plot of comparison of the actual data and the predicted values imputed by using Expectation Maximization Algorithm estimation process for 5% of missing values at Kuantan station.

Fig. 3.
figure 3

Predicted values by using Bootstrapping and Expectation Maximization Algorithm versus actual values in Kuantan station (5%) (Color figure online)

Figure 4 demonstrated the plot of comparison of the actual data and the predicted values imputed by using Multivariate Imputations by Chained Equation process for 5% of missing values at Kuantan station.

Fig. 4.
figure 4

Predicted values by using Multivariate Imputations by Chained Equation versus actual values in Kuantan station (5%) (Color figure online)

In order to determine the best imputation method, the lowest value of MAE and RMSE will be chosen as well as the highest value for the R square. Based on the results showed in Table 3, it could be concluded that the Artificial Neural Network (ANN) is observed to be the best imputation method followed by Multiple Imputation by Chained Equation (MICE) and Bootstrapping and Expectation Maximization Algorithm method. It is observed that the result produced by MAE and RMSE are consistently showed the lowest, and the highest value of R square for NEURALNET and followed by MICE and finally by AMELIA for the three level of missing values. Thus, it could be concluded that based on the data used in the study, the Artificial Neural Network (ANN) is found to be the best imputation method in generating missing rainfall data at Kuantan station.

Table 3. Performance measures for Mean Absolute Error, Root Mean Squared Error and Coefficient of Determination

4 Conclusion

The aim of this research is to compare three imputation methods in imputing the missing values at Kuantan station due to the completeness in the data set for the period of 1975 up to 2017. In evaluating the three imputations methods, missing data were created at three different levels, 5%, 10% and 15%. In addition, missing data at Kuantan station has been generated using Missing at Random (MAR) assumption. The predicted imputation results for each method were compared with the actual data at this station.

The performance for each method was evaluated using the three performance measures such as MAE, RMSE and R Squared. According to some previous studies, the most popular method for imputation of rainfall data is not necessarily to be most efficient. However, the results shown that the Artificial Neural Network (ANN) by using neuralnet package in R demonstrated the best method estimation for imputing the missing rainfall data compared to the two other methods. Meanwhile, Bootstrapping and Expectation Maximization Algorithm in Amelia package and Multivariate Imputation by Chained Equation (MICE) in MICE package of R Programming are found rarely been used for the estimation of missing rainfall data. However, these two methods and R programming packages which are Amelia and MICE, are both widely used in the imputation of missing data only in other extent aside from missing rainfall data.