Comparison of Artificial Neural Network (ANN) and Other Imputation Methods in Estimating Missing Rainfall Data at Kuantan Station

Norazizi, Nur Afiqah Ahmad; Deni, Sayang Mohd

doi:10.1007/978-981-15-0399-3_24

Nur Afiqah Ahmad Norazizi¹¹ &
Sayang Mohd Deni^11,12

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1100))

Included in the following conference series:

International Conference on Soft Computing in Data Science

805 Accesses
8 Citations

Abstract

Daily rainfall data could be considered as one of the basic inputs in hydrological (e.g. streamflow, rainfall-runoff, recharge) and environmental (e.g. crop yield, drought risk) models as well as in assessing the water quality. In Malaysia, the number of rain gauge stations with complete records for a long duration is very scarce. The occurrence of missing values in rainfall data is mainly due to malfunctioning of equipment and severe environmental conditions. Thus, the estimation of rainfall is needed, whenever the missing data happened at the principal rainfall station. In this study, daily rainfall data from eight meteorological stations located in Pahang state are considered and Kuantan is selected as the target station. The main purposes of this study is to compare the performance of the imputation methods by using Artificial Neural Network method (ANN), Bootstrapping and Expectation Maximization Algorithm method and Multivariate Imputation by Chained Equations method (MICE). Missing rainfall data has been generated randomly for Kuantan station with 5%, 10% and 15% of missingness. The three methods are compared based on Mean Absolute Error (MAE), Root Mean Square Error (RMSE) and Coefficient of Determination (R2). The findings concluded that Artificial Neural Network (ANN) is found to be the best imputation method for this study, followed by Multiple Imputation by Chained Equation (MICE) and Bootstrapping and Expectation Maximization Algorithm method.

Access provided by Autonomous University of Puebla. Download conference paper PDF

Imputation of Rainfall Data Using Improved Neural Network Algorithm

Comparative assessment of univariate and multivariate imputation models for varying lengths of missing rainfall data in a humid tropical region: a case study of Kozhikode, Kerala, India

Article 25 July 2023

Missing rainfall data estimation—an approach to investigate different methods: case study of Baghdad

Article Open access 30 November 2022

Keywords

1 Introduction

Missing data is one of the common problems arise in the process of data collection. Incomplete data occurs for a variety of reasons, such as, interruption of experiments, equipment failure, measurement limitation, attrition in longitudinal studies, censoring, usage of new instruments, changing methods of record keeping, lost of records, and non response to questionnaire items [1, 2, 3, 5, 8, 11, 14, 15, 18, 21,22,23,24,25,26,27,28].

There are various types of data driven models including Artificial Neural Network (ANN) which could be used for the implementation of weather forecast. The ANN provides good approximation due to the capability of the network, dynamic and works well with non-stationary data. ANN is a popular method for many hydrological data analyses as demonstrated by many researchers [6, 7, 9, 12,13,14,15, 19]. The ANN has been shown to be one of the best methods for missing data prediction at par with fuzzy logic as a fuzzy rule-based approach. The estimation of missing rainfall data using ANN were compared with the results obtained using regression and other simpler techniques such as arithmetic and inverse distance method. Based on some previous studies, the ANN was chosen to be compared with other techniques due to its adequacy and reliable in predicting missing rainfall at particular gauge stations. Meanwhile, other methods requires much longer duration of time to estimate the missing values and the process involved much more complicated if compared with ANN [3, 21].

Thus, the aim of this study is to estimate the missing rainfall data by using Artificial Neural Network (ANN), Bootstrapping and Expectation Maximization Algorithm and Multivariate Imputation by Chained Equations (MICE). These methods were chosen due to the successfulness and the capability in handling the missing value problems as mentioned by some previous researchers in the field of study [2, 4, 6, 7, 9, 12, 15, 18, 19]. In evaluating those methods, three different level of missing values such as 5%, 10% and 15% will be considered. The performance of these methods is assessed using the Mean Absolute Error (MAE), Root Mean Square Error (RMSE) and Correlation of Determination (R²). In addition, the results of this study can provide knowledge to the researchers on several alternatives techniques for hydrological data in Malaysia and worldwide on which technique would give the best performance. Research and development in this study is conducted on continuous basis, thus there might be new findings on the issue of environmental as well.

2 Materials and Methods

2.1 Data Description

In this study, daily rainfall data were obtained from the Malaysian Meteorological Department and Drainage and Irrigation Department of Malaysia. The secondary data consists of daily rainfall amount (mm) for the period of 1975 up to 2017. Kuantan station which is located in the state of Pahang is chosen due to heavy seasonal rainfall and winds that affected most parts of Peninsular Malaysia during December 2014. The rains caused severe flooding in the East Coast region i.e. Terengganu, Pahang, and Kelantan states [17]. There were eight meteorological stations located in Pahang were selected such as Sekolah Menengah Ahmad Pekan, Felda Bukit Tajau, Felda Kampung New Zealand, Felda Sungai Pancing Selatan, Pusat Pertanian Tanaman Kampung Awah, Pusat Pertanian Bukit Goh and Mardi Sungai Baging as well as Kuantan station was chosen as the target station (Table 1).

Table 1. The geographical coordinates of the selected rainfall stations

Full size table

2.2 Missing Data Imputation Analysis by Using Artificial Neural Network

In the process of imputation of missing values, the data was trained using ANN with three different learning algorithms, namely, conjugate gradient Fletcher–Reeves update (CGF), Broyden–Fletcher–Goldforb–Shanno (BFG) and Levenberg–Marquardt (LM). Three different units of analysis such as, daily, 10-day and monthly rainfall values, were used for evaluating the prediction ability of ANN. The output of the network was identified as the amount of precipitation at station X, px(t), with the inputs which was determined by the amount of the neighboring stations within the same duration in the period of time. The equation of the model assessment can be expressed as:

$$ p_{x} \left( t \right) = f\left[ {p_{1} \left( t \right), p_{2} \left( t \right), p_{3} \left( t \right), p_{4} \left( t \right),p_{5} \left( t \right),p_{6} \left( t \right),p_{7} \left( t \right)} \right] $$

(1)

where p1, p2, p3, p4, p5, p6 and p7 are denoted as the amount of rainfall at seven neighboring stations except Kuantan station. The ANN was trained and simulated using R Programming with the neuralnet package.

2.3 Missing Data Imputation Method by Using Bootstrapping and Expectation Maximization Algorithm

One of the advantages of Bootstrapping and Expectation Maximization Algorithm in Amelia package of R Programming is that, the combination of speed and the ease-of-use of algorithm with the power of multiple imputations will take into consideration.

The imputation model in Amelia package assumes that the complete data (that is, both observed and unobserved) are multivariate normal. It is denoted as the (n × k) dataset with D is defined as the observed part, D_obs and unobserved part, D_mis, where the assumption is given as the following equation:

$$ {\text{D}} \sim {\text{N}}_{\text{k}} \left( {\upmu,\Sigma } \right) $$

(2)

The state, D is defined as multivariate normal distribution with mean vector µ and covariance matrix Σ. The multivariate normal distribution is often to give a crude approximation to the true distribution of the data. Moreover, there is evidence to show that this model works as well as the other models which are more complicated and contained the mixed data. It has been reported by some previous researchers that multivariate normal model can provide valid estimate even though the assumptions is violated. This may be due to the large sample size and small percentage of missing values occurred in the dataset [3, 21]. Furthermore, transformations of many types of variables can often make this normality assumption more plausible.

2.4 Missing Data Imputation Analysis by Using Multivariate Imputation by Chained Equation

Multiple Imputation by Chained Equations (MICE) is a practical approach to generate missing rainfall values based on a set of imputation models. MICE is also known as fully conditional specification and sequential regression multivariate imputation. The three stages of MICE are described as below:

i.
Generating Multiple Imputed Data Sets
ii.
Analyzing Multiple Imputed Data Sets
iii.
Combining Estimates from Multiply Imputed Data Sets

The m estimates are combined into an overall estimate and variance–covariance matrix using Rubin’s rules, which are based on asymptotic theory in a Bayesian framework,

$$ \hat{\theta }_{j} = \frac{1}{m}\sum\nolimits_{j = 1}^{m} {\hat{\theta }_{j} } $$

(3)

2.5 Performance Measure Criteria

Three performance indicators which are Mean Absolute Error (MAE), Root Mean Square Error (RMSE) and Coefficient of Determination (R2) will be used to assess the imputation methods.

$$ MAE = \frac{1}{n}\sum\nolimits_{i = 1}^{n} {\left| {f_{i} - y_{i} } \right|} = \frac{1}{n}\sum\nolimits_{i = 1}^{b} {\left| {e_{i} } \right|} $$

(4)

where f_i is the prediction and y_i is the true value.

$$ RMSE = \sqrt {\frac{{\sum\nolimits_{i = 1}^{n} {\left( {X_{obs,i} - X_{model,i} } \right)^{2} } }}{n}} $$

(5)

Where X_obs is the observed values and X_model is the modeled values, at time or location i.

$$ R^{2} = 1 - \frac{RSS}{TSS} = 1 - \frac{{\sum {e_{i}^{2} } }}{{\sum {y_{i}^{2} } }} = \frac{ESS}{TSS} = \frac{{\sum {\hat{y}_{i}^{2} } }}{{\sum {y_{i}^{2} } }} $$

(6)

3 Results and Discussion

Referring to Table 2, the lowest mean of rainfall amount of 5.3 mm is recorded at Pusat Pertanian Tanaman Kampung Awah. Meanwhile, the highest mean of rainfall amount of 8 mm is observed at Felda Sungai Pancing Selatan. For the period of from 1975 up to 2017, it is observed that Kuantan station is having a complete of records with no missing values. Meanwhile, 9.4% of missing values is found from Felda Bukit Tajau station which is the most missing values recorded compared to the rest of neighboring stations.

Table 2. Descriptive statistics for the target station and the neighboring stations

Full size table

The visualization of the computed Artificial Neural Network is shown below. The model has 3 hidden units including the neurons and layers. The black lines showed connections with the weights of the values which will be used to impute the missing values in the target station. In addition, the blue line demonstrated the bias item which is generated by the neural network estimation. The figure showed the results for the 5% generated missing values for Kuantan station (Fig. 1).

Figure 2 shows the graph of the comparison of the actual data and the predicted values imputed by using Artificial Neural Network (ANN) for 5% of missing values at Kuantan station. For Fig. 2 up to Fig. 4, the blue and red dots are denoted as the actual values and the predicted values, respectively.

Figure 3 showed the plot of comparison of the actual data and the predicted values imputed by using Expectation Maximization Algorithm estimation process for 5% of missing values at Kuantan station.

Figure 4 demonstrated the plot of comparison of the actual data and the predicted values imputed by using Multivariate Imputations by Chained Equation process for 5% of missing values at Kuantan station.

In order to determine the best imputation method, the lowest value of MAE and RMSE will be chosen as well as the highest value for the R square. Based on the results showed in Table 3, it could be concluded that the Artificial Neural Network (ANN) is observed to be the best imputation method followed by Multiple Imputation by Chained Equation (MICE) and Bootstrapping and Expectation Maximization Algorithm method. It is observed that the result produced by MAE and RMSE are consistently showed the lowest, and the highest value of R square for NEURALNET and followed by MICE and finally by AMELIA for the three level of missing values. Thus, it could be concluded that based on the data used in the study, the Artificial Neural Network (ANN) is found to be the best imputation method in generating missing rainfall data at Kuantan station.

Table 3. Performance measures for Mean Absolute Error, Root Mean Squared Error and Coefficient of Determination

Full size table

4 Conclusion

The aim of this research is to compare three imputation methods in imputing the missing values at Kuantan station due to the completeness in the data set for the period of 1975 up to 2017. In evaluating the three imputations methods, missing data were created at three different levels, 5%, 10% and 15%. In addition, missing data at Kuantan station has been generated using Missing at Random (MAR) assumption. The predicted imputation results for each method were compared with the actual data at this station.

The performance for each method was evaluated using the three performance measures such as MAE, RMSE and R Squared. According to some previous studies, the most popular method for imputation of rainfall data is not necessarily to be most efficient. However, the results shown that the Artificial Neural Network (ANN) by using neuralnet package in R demonstrated the best method estimation for imputing the missing rainfall data compared to the two other methods. Meanwhile, Bootstrapping and Expectation Maximization Algorithm in Amelia package and Multivariate Imputation by Chained Equation (MICE) in MICE package of R Programming are found rarely been used for the estimation of missing rainfall data. However, these two methods and R programming packages which are Amelia and MICE, are both widely used in the imputation of missing data only in other extent aside from missing rainfall data.

References

Abuelgasim, A.A., Gopal, S., Strahler, A.H.: Forward and inverse modelling of canopy directional reflectance using a neural network. Int. J. Remote Sens. 19(3), 453–471 (1998)
Article Google Scholar
Amer, S.R.: Neural network imputation: a new fashion or a good tool. Unpublished Ph.D. thesis (2004)
Google Scholar
Demirtas, H., Freels, S.A., Yucel, R.M.: Plausibility of multivariate normality assumption when multiply imputing non-Gaussian continuous outcomes: a simulation assessment. JSCS 78(1), 69–84 (2008)
MathSciNet MATH Google Scholar
Fausett, L.: Fundamentals of Neural Networks: Architectures, Algorithms, and Applications (No. 006.3). Prentice-Hall (1994)
Google Scholar
Kamaruzaman, I.F., Zin, W.Z.W., Ariff, N.M.: A comparison of method for treating missing daily rainfall data in Peninsular Malaysia. Malays. J. Fundam. Appl. Sci. (4–1), 375–380 (2017)
Article Google Scholar
Khan, I.Y., Zope, P.H., Suralkar, S.R.: Importance of artificial neural network in medical diagnosis disease like acute nephritis disease and heart disease. Int. J. Eng. Sci. Innov. Technol. (IJESIT) 2(2), 210–217 (2013)
Google Scholar
Karunanithi, N., Grenney, W.J., Whitley, D., Bovee, K.: Neural networks for river flow prediction. J. Comput. Civ. Eng. 8(2), 201–220 (1994)
Article Google Scholar
Shaadan, N., Deni, S.M., Jemain, A.A.: Application of functional data analysis for the treatment of missing air quality data. Sains Malays. 44(10), 1531–1540 (2015)
Article Google Scholar
Kuligowski, R.J., Barros, A.P.: Using artificial neural networks to estimate missing rainfall data. J. Am. Water Resour. Assoc. 34(6), 1437–1447 (1998)
Article Google Scholar
Le Barbé, L., Lebel, T., Tapsoba, D.: Rain fall variability in West Africa during the years 1950–1990. J. Clim. 15(2), 187–202 (2002)
Article Google Scholar
Leung, H., Haykin, S.: Detection and estimation using an adaptive rational function filter. IEEE Trans. Signal Process. 42(12), 3366–3376 (1994)
Article Google Scholar
Livingstone, D.J., Manallack, D.T., Tetko, I.V.: Data modeling with neural networks–an answer to the maiden’s prayer? J. Comp. Aid. Mol. Des. 11, 135–142 (1996)
Article Google Scholar
Nasr, M., Zahran, H.F.: Using of pH as a tool to predict salinity of groundwater for irrigation purpose using artificial neural network. Egypt. J. Aquat. Res. 40(2), 111–115 (2014)
Article Google Scholar
Rahman, N.A., Deni, S.M., Ramli, N.M.: Generalized linear model for estimation of missing daily rainfall data. AIP Conf. Proc. 1830(1), 080019 (2017)
Article Google Scholar
Paulhus, J.L., Kohler, M.A.: Interpolation of missing precipitation records. Mon. Weather Rev. 80(8), 129–133 (1952)
Article Google Scholar
Ratnayake, U., Herath, S.: Changing rainfall and its impact on landslides in Sri Lanka. J. Mt. Sci. 2(3), 218–224 (2005)
Article Google Scholar
ReliefWeb (2014). Malaysia: Seasonal Floods 2014 - Information Bulletin n° 1. https://reliefweb.int/report/malaysia/malaysia-seasonal-floods-2014-aaainformation-bulletin-n-1. Accessed 4 Nov 2017
Royston, P., White, I.R.: Multiple imputation by chained equations (MICE): implementation in Stata. J. Stat. Softw. 45(4), 1–20 (2011)
Article Google Scholar
Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature 323(6088), 533 (1986)
Article Google Scholar
Rubin, D.B.: Inference and missing data. Biometrika 63(3), 581–592 (1976)
Article MathSciNet Google Scholar
Schaming, D., et al.: Easy methods for the electropolymerization of porphyrins based on the oxidation of the macrocycles. Electrochim. Acta 56(28), 10454–10463 (2011)
Article Google Scholar
Schafer, J.L.: Analysis of Incomplete Multivariate Data. Chapman & Hall/CRC, London (1997)
Book Google Scholar
Burhanuddin, S.N.Z.A., Deni, S.M., Ramli, N.M.: Normal ratio in multiple imputation based on Bootstrapped sample for rainfall data with missingness. Int. J. Geomate 13(36), 131–137 (2017)
Google Scholar
Suhaila, J., Jemain, A.A., Hamdan, M.F., Zin, W.Z.W.: Comparing rainfall patterns between regions in Peninsular Malaysia via a functional data analysis technique. J. Hydrol. 411(3), 197–206 (2011)
Article Google Scholar
Suhaila, J., Sayang, M.D., Jemain, A.A.: Revised spatial weighting methods for estimation of missing rainfall data. Asia Pac. J. Atmos. Sci. 44(2), 93–104 (2008)
Google Scholar
Tang, W.Y., Kassim, A.H.M., Abu Bakar, S.H.: Comparative studies of various missing data treatment methods-Malaysian experience. Atmos. Res. 42(1–4), 247–262 (1996)
Article Google Scholar
Von Davier, M.: Imputing proficiency data under planned missingness in population models. In: Handbook of International Large-Scale Assessment. Background, Technical Issues, and Methods of Data Analysis, pp. 175–202 (2014)
Google Scholar
Young, K.C.: A three-way model for interpolating for monthly precipitation values. Mon. Weather Rev. 120(11), 2561–2569 (1992)
Article Google Scholar

Download references

Acknowledgement

The authors wish to thank Malaysian Meteorological Department for the data and sponsorship from Faculty of Computer and Mathematical Sciences, Universiti Teknologi MARA (UiTM). The authors are also indebted to the staff of the Drainage and Irrigation Department for providing the daily rainfall data for this study. They also acknowledge their sincere appreciation to the reviewers for their valuable suggestion and remarks in order to improve the manuscript. This research will not complete without the sponsorship from Ministry of Higher Education (600-RMI/TRGS DIS 5/3 (1/2015)).

Author information

Authors and Affiliations

Centre for Statistics and Decision Science Studies, Faculty of Computer and Mathematical Sciences, Universiti Teknologi MARA, 40450, Shah Alam, Selangor, Malaysia
Nur Afiqah Ahmad Norazizi & Sayang Mohd Deni
Advanced Analytic Engineering Center (AAEC), Faculty of Computer and Mathematical Sciences, Universiti Teknologi MARA, 40450, Shah Alam, Selangor, Malaysia
Sayang Mohd Deni

Authors

Nur Afiqah Ahmad Norazizi
View author publications
You can also search for this author in PubMed Google Scholar
Sayang Mohd Deni
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sayang Mohd Deni .

Editor information

Editors and Affiliations

University of Tennessee, Knoxville, TN, USA
Michael W. Berry
Universiti Teknologi MARA, Shah Alam, Selangor, Malaysia
Bee Wah Yap
Universiti Teknologi MARA, Shah Alam, Selangor, Malaysia
Azlinah Mohamed
Kyushu Institute of Technology, Fukuoka, Japan
Mario Köppen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Norazizi, N.A.A., Deni, S.M. (2019). Comparison of Artificial Neural Network (ANN) and Other Imputation Methods in Estimating Missing Rainfall Data at Kuantan Station. In: Berry, M., Yap, B., Mohamed, A., Köppen, M. (eds) Soft Computing in Data Science. SCDS 2019. Communications in Computer and Information Science, vol 1100. Springer, Singapore. https://doi.org/10.1007/978-981-15-0399-3_24

Download citation

DOI: https://doi.org/10.1007/978-981-15-0399-3_24
Published: 24 September 2019
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-0398-6
Online ISBN: 978-981-15-0399-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Comparison of Artificial Neural Network (ANN) and Other Imputation Methods in Estimating Missing Rainfall Data at Kuantan Station

Abstract

Similar content being viewed by others

Imputation of Rainfall Data Using Improved Neural Network Algorithm

Comparative assessment of univariate and multivariate imputation models for varying lengths of missing rainfall data in a humid tropical region: a case study of Kozhikode, Kerala, India

Missing rainfall data estimation—an approach to investigate different methods: case study of Baghdad

Keywords

1 Introduction