Abstract
Daily rainfall data could be considered as one of the basic inputs in hydrological (e.g. streamflow, rainfall-runoff, recharge) and environmental (e.g. crop yield, drought risk) models as well as in assessing the water quality. In Malaysia, the number of rain gauge stations with complete records for a long duration is very scarce. The occurrence of missing values in rainfall data is mainly due to malfunctioning of equipment and severe environmental conditions. Thus, the estimation of rainfall is needed, whenever the missing data happened at the principal rainfall station. In this study, daily rainfall data from eight meteorological stations located in Pahang state are considered and Kuantan is selected as the target station. The main purposes of this study is to compare the performance of the imputation methods by using Artificial Neural Network method (ANN), Bootstrapping and Expectation Maximization Algorithm method and Multivariate Imputation by Chained Equations method (MICE). Missing rainfall data has been generated randomly for Kuantan station with 5%, 10% and 15% of missingness. The three methods are compared based on Mean Absolute Error (MAE), Root Mean Square Error (RMSE) and Coefficient of Determination (R2). The findings concluded that Artificial Neural Network (ANN) is found to be the best imputation method for this study, followed by Multiple Imputation by Chained Equation (MICE) and Bootstrapping and Expectation Maximization Algorithm method.
Access provided by Autonomous University of Puebla. Download conference paper PDF
Similar content being viewed by others
Keywords
- Daily rainfall data
- Artificial Neural Network
- Bootstrapping and Expectation Maximization Algorithm
- Multivariate Imputation by Chained Equations
- Imputation method
- Missing data
1 Introduction
Missing data is one of the common problems arise in the process of data collection. Incomplete data occurs for a variety of reasons, such as, interruption of experiments, equipment failure, measurement limitation, attrition in longitudinal studies, censoring, usage of new instruments, changing methods of record keeping, lost of records, and non response to questionnaire items [1, 2, 3, 5, 8, 11, 14, 15, 18, 21,22,23,24,25,26,27,28].
There are various types of data driven models including Artificial Neural Network (ANN) which could be used for the implementation of weather forecast. The ANN provides good approximation due to the capability of the network, dynamic and works well with non-stationary data. ANN is a popular method for many hydrological data analyses as demonstrated by many researchers [6, 7, 9, 12,13,14,15, 19]. The ANN has been shown to be one of the best methods for missing data prediction at par with fuzzy logic as a fuzzy rule-based approach. The estimation of missing rainfall data using ANN were compared with the results obtained using regression and other simpler techniques such as arithmetic and inverse distance method. Based on some previous studies, the ANN was chosen to be compared with other techniques due to its adequacy and reliable in predicting missing rainfall at particular gauge stations. Meanwhile, other methods requires much longer duration of time to estimate the missing values and the process involved much more complicated if compared with ANN [3, 21].
Thus, the aim of this study is to estimate the missing rainfall data by using Artificial Neural Network (ANN), Bootstrapping and Expectation Maximization Algorithm and Multivariate Imputation by Chained Equations (MICE). These methods were chosen due to the successfulness and the capability in handling the missing value problems as mentioned by some previous researchers in the field of study [2, 4, 6, 7, 9, 12, 15, 18, 19]. In evaluating those methods, three different level of missing values such as 5%, 10% and 15% will be considered. The performance of these methods is assessed using the Mean Absolute Error (MAE), Root Mean Square Error (RMSE) and Correlation of Determination (R2). In addition, the results of this study can provide knowledge to the researchers on several alternatives techniques for hydrological data in Malaysia and worldwide on which technique would give the best performance. Research and development in this study is conducted on continuous basis, thus there might be new findings on the issue of environmental as well.
2 Materials and Methods
2.1 Data Description
In this study, daily rainfall data were obtained from the Malaysian Meteorological Department and Drainage and Irrigation Department of Malaysia. The secondary data consists of daily rainfall amount (mm) for the period of 1975 up to 2017. Kuantan station which is located in the state of Pahang is chosen due to heavy seasonal rainfall and winds that affected most parts of Peninsular Malaysia during December 2014. The rains caused severe flooding in the East Coast region i.e. Terengganu, Pahang, and Kelantan states [17]. There were eight meteorological stations located in Pahang were selected such as Sekolah Menengah Ahmad Pekan, Felda Bukit Tajau, Felda Kampung New Zealand, Felda Sungai Pancing Selatan, Pusat Pertanian Tanaman Kampung Awah, Pusat Pertanian Bukit Goh and Mardi Sungai Baging as well as Kuantan station was chosen as the target station (Table 1).
2.2 Missing Data Imputation Analysis by Using Artificial Neural Network
In the process of imputation of missing values, the data was trained using ANN with three different learning algorithms, namely, conjugate gradient Fletcher–Reeves update (CGF), Broyden–Fletcher–Goldforb–Shanno (BFG) and Levenberg–Marquardt (LM). Three different units of analysis such as, daily, 10-day and monthly rainfall values, were used for evaluating the prediction ability of ANN. The output of the network was identified as the amount of precipitation at station X, px(t), with the inputs which was determined by the amount of the neighboring stations within the same duration in the period of time. The equation of the model assessment can be expressed as:
where p1, p2, p3, p4, p5, p6 and p7 are denoted as the amount of rainfall at seven neighboring stations except Kuantan station. The ANN was trained and simulated using R Programming with the neuralnet package.
2.3 Missing Data Imputation Method by Using Bootstrapping and Expectation Maximization Algorithm
One of the advantages of Bootstrapping and Expectation Maximization Algorithm in Amelia package of R Programming is that, the combination of speed and the ease-of-use of algorithm with the power of multiple imputations will take into consideration.
The imputation model in Amelia package assumes that the complete data (that is, both observed and unobserved) are multivariate normal. It is denoted as the (n × k) dataset with D is defined as the observed part, Dobs and unobserved part, Dmis, where the assumption is given as the following equation:
The state, D is defined as multivariate normal distribution with mean vector µ and covariance matrix Σ. The multivariate normal distribution is often to give a crude approximation to the true distribution of the data. Moreover, there is evidence to show that this model works as well as the other models which are more complicated and contained the mixed data. It has been reported by some previous researchers that multivariate normal model can provide valid estimate even though the assumptions is violated. This may be due to the large sample size and small percentage of missing values occurred in the dataset [3, 21]. Furthermore, transformations of many types of variables can often make this normality assumption more plausible.
2.4 Missing Data Imputation Analysis by Using Multivariate Imputation by Chained Equation
Multiple Imputation by Chained Equations (MICE) is a practical approach to generate missing rainfall values based on a set of imputation models. MICE is also known as fully conditional specification and sequential regression multivariate imputation. The three stages of MICE are described as below:
-
i.
Generating Multiple Imputed Data Sets
-
ii.
Analyzing Multiple Imputed Data Sets
-
iii.
Combining Estimates from Multiply Imputed Data Sets
The m estimates are combined into an overall estimate and variance–covariance matrix using Rubin’s rules, which are based on asymptotic theory in a Bayesian framework,
2.5 Performance Measure Criteria
Three performance indicators which are Mean Absolute Error (MAE), Root Mean Square Error (RMSE) and Coefficient of Determination (R2) will be used to assess the imputation methods.
where fi is the prediction and yi is the true value.
Where Xobs is the observed values and Xmodel is the modeled values, at time or location i.
3 Results and Discussion
Referring to Table 2, the lowest mean of rainfall amount of 5.3 mm is recorded at Pusat Pertanian Tanaman Kampung Awah. Meanwhile, the highest mean of rainfall amount of 8 mm is observed at Felda Sungai Pancing Selatan. For the period of from 1975 up to 2017, it is observed that Kuantan station is having a complete of records with no missing values. Meanwhile, 9.4% of missing values is found from Felda Bukit Tajau station which is the most missing values recorded compared to the rest of neighboring stations.
The visualization of the computed Artificial Neural Network is shown below. The model has 3 hidden units including the neurons and layers. The black lines showed connections with the weights of the values which will be used to impute the missing values in the target station. In addition, the blue line demonstrated the bias item which is generated by the neural network estimation. The figure showed the results for the 5% generated missing values for Kuantan station (Fig. 1).
Figure 2 shows the graph of the comparison of the actual data and the predicted values imputed by using Artificial Neural Network (ANN) for 5% of missing values at Kuantan station. For Fig. 2 up to Fig. 4, the blue and red dots are denoted as the actual values and the predicted values, respectively.
Figure 3 showed the plot of comparison of the actual data and the predicted values imputed by using Expectation Maximization Algorithm estimation process for 5% of missing values at Kuantan station.
Figure 4 demonstrated the plot of comparison of the actual data and the predicted values imputed by using Multivariate Imputations by Chained Equation process for 5% of missing values at Kuantan station.
In order to determine the best imputation method, the lowest value of MAE and RMSE will be chosen as well as the highest value for the R square. Based on the results showed in Table 3, it could be concluded that the Artificial Neural Network (ANN) is observed to be the best imputation method followed by Multiple Imputation by Chained Equation (MICE) and Bootstrapping and Expectation Maximization Algorithm method. It is observed that the result produced by MAE and RMSE are consistently showed the lowest, and the highest value of R square for NEURALNET and followed by MICE and finally by AMELIA for the three level of missing values. Thus, it could be concluded that based on the data used in the study, the Artificial Neural Network (ANN) is found to be the best imputation method in generating missing rainfall data at Kuantan station.
4 Conclusion
The aim of this research is to compare three imputation methods in imputing the missing values at Kuantan station due to the completeness in the data set for the period of 1975 up to 2017. In evaluating the three imputations methods, missing data were created at three different levels, 5%, 10% and 15%. In addition, missing data at Kuantan station has been generated using Missing at Random (MAR) assumption. The predicted imputation results for each method were compared with the actual data at this station.
The performance for each method was evaluated using the three performance measures such as MAE, RMSE and R Squared. According to some previous studies, the most popular method for imputation of rainfall data is not necessarily to be most efficient. However, the results shown that the Artificial Neural Network (ANN) by using neuralnet package in R demonstrated the best method estimation for imputing the missing rainfall data compared to the two other methods. Meanwhile, Bootstrapping and Expectation Maximization Algorithm in Amelia package and Multivariate Imputation by Chained Equation (MICE) in MICE package of R Programming are found rarely been used for the estimation of missing rainfall data. However, these two methods and R programming packages which are Amelia and MICE, are both widely used in the imputation of missing data only in other extent aside from missing rainfall data.
References
Abuelgasim, A.A., Gopal, S., Strahler, A.H.: Forward and inverse modelling of canopy directional reflectance using a neural network. Int. J. Remote Sens. 19(3), 453–471 (1998)
Amer, S.R.: Neural network imputation: a new fashion or a good tool. Unpublished Ph.D. thesis (2004)
Demirtas, H., Freels, S.A., Yucel, R.M.: Plausibility of multivariate normality assumption when multiply imputing non-Gaussian continuous outcomes: a simulation assessment. JSCS 78(1), 69–84 (2008)
Fausett, L.: Fundamentals of Neural Networks: Architectures, Algorithms, and Applications (No. 006.3). Prentice-Hall (1994)
Kamaruzaman, I.F., Zin, W.Z.W., Ariff, N.M.: A comparison of method for treating missing daily rainfall data in Peninsular Malaysia. Malays. J. Fundam. Appl. Sci. (4–1), 375–380 (2017)
Khan, I.Y., Zope, P.H., Suralkar, S.R.: Importance of artificial neural network in medical diagnosis disease like acute nephritis disease and heart disease. Int. J. Eng. Sci. Innov. Technol. (IJESIT) 2(2), 210–217 (2013)
Karunanithi, N., Grenney, W.J., Whitley, D., Bovee, K.: Neural networks for river flow prediction. J. Comput. Civ. Eng. 8(2), 201–220 (1994)
Shaadan, N., Deni, S.M., Jemain, A.A.: Application of functional data analysis for the treatment of missing air quality data. Sains Malays. 44(10), 1531–1540 (2015)
Kuligowski, R.J., Barros, A.P.: Using artificial neural networks to estimate missing rainfall data. J. Am. Water Resour. Assoc. 34(6), 1437–1447 (1998)
Le Barbé, L., Lebel, T., Tapsoba, D.: Rain fall variability in West Africa during the years 1950–1990. J. Clim. 15(2), 187–202 (2002)
Leung, H., Haykin, S.: Detection and estimation using an adaptive rational function filter. IEEE Trans. Signal Process. 42(12), 3366–3376 (1994)
Livingstone, D.J., Manallack, D.T., Tetko, I.V.: Data modeling with neural networks–an answer to the maiden’s prayer? J. Comp. Aid. Mol. Des. 11, 135–142 (1996)
Nasr, M., Zahran, H.F.: Using of pH as a tool to predict salinity of groundwater for irrigation purpose using artificial neural network. Egypt. J. Aquat. Res. 40(2), 111–115 (2014)
Rahman, N.A., Deni, S.M., Ramli, N.M.: Generalized linear model for estimation of missing daily rainfall data. AIP Conf. Proc. 1830(1), 080019 (2017)
Paulhus, J.L., Kohler, M.A.: Interpolation of missing precipitation records. Mon. Weather Rev. 80(8), 129–133 (1952)
Ratnayake, U., Herath, S.: Changing rainfall and its impact on landslides in Sri Lanka. J. Mt. Sci. 2(3), 218–224 (2005)
ReliefWeb (2014). Malaysia: Seasonal Floods 2014 - Information Bulletin n° 1. https://reliefweb.int/report/malaysia/malaysia-seasonal-floods-2014-aaainformation-bulletin-n-1. Accessed 4 Nov 2017
Royston, P., White, I.R.: Multiple imputation by chained equations (MICE): implementation in Stata. J. Stat. Softw. 45(4), 1–20 (2011)
Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature 323(6088), 533 (1986)
Rubin, D.B.: Inference and missing data. Biometrika 63(3), 581–592 (1976)
Schaming, D., et al.: Easy methods for the electropolymerization of porphyrins based on the oxidation of the macrocycles. Electrochim. Acta 56(28), 10454–10463 (2011)
Schafer, J.L.: Analysis of Incomplete Multivariate Data. Chapman & Hall/CRC, London (1997)
Burhanuddin, S.N.Z.A., Deni, S.M., Ramli, N.M.: Normal ratio in multiple imputation based on Bootstrapped sample for rainfall data with missingness. Int. J. Geomate 13(36), 131–137 (2017)
Suhaila, J., Jemain, A.A., Hamdan, M.F., Zin, W.Z.W.: Comparing rainfall patterns between regions in Peninsular Malaysia via a functional data analysis technique. J. Hydrol. 411(3), 197–206 (2011)
Suhaila, J., Sayang, M.D., Jemain, A.A.: Revised spatial weighting methods for estimation of missing rainfall data. Asia Pac. J. Atmos. Sci. 44(2), 93–104 (2008)
Tang, W.Y., Kassim, A.H.M., Abu Bakar, S.H.: Comparative studies of various missing data treatment methods-Malaysian experience. Atmos. Res. 42(1–4), 247–262 (1996)
Von Davier, M.: Imputing proficiency data under planned missingness in population models. In: Handbook of International Large-Scale Assessment. Background, Technical Issues, and Methods of Data Analysis, pp. 175–202 (2014)
Young, K.C.: A three-way model for interpolating for monthly precipitation values. Mon. Weather Rev. 120(11), 2561–2569 (1992)
Acknowledgement
The authors wish to thank Malaysian Meteorological Department for the data and sponsorship from Faculty of Computer and Mathematical Sciences, Universiti Teknologi MARA (UiTM). The authors are also indebted to the staff of the Drainage and Irrigation Department for providing the daily rainfall data for this study. They also acknowledge their sincere appreciation to the reviewers for their valuable suggestion and remarks in order to improve the manuscript. This research will not complete without the sponsorship from Ministry of Higher Education (600-RMI/TRGS DIS 5/3 (1/2015)).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Norazizi, N.A.A., Deni, S.M. (2019). Comparison of Artificial Neural Network (ANN) and Other Imputation Methods in Estimating Missing Rainfall Data at Kuantan Station. In: Berry, M., Yap, B., Mohamed, A., Köppen, M. (eds) Soft Computing in Data Science. SCDS 2019. Communications in Computer and Information Science, vol 1100. Springer, Singapore. https://doi.org/10.1007/978-981-15-0399-3_24
Download citation
DOI: https://doi.org/10.1007/978-981-15-0399-3_24
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-0398-6
Online ISBN: 978-981-15-0399-3
eBook Packages: Computer ScienceComputer Science (R0)