1 Introduction

One of the most important hydrological factors is rainfall, which is responsible for initiating various hydrological processes within the system and consequently providing data for various different analyses (Wangwongchai et al. 2023). According to the report prepared by the Center for Research on Disaster Epidemiology (CRED), the most impactful natural disasters in 2022 were droughts and floods influenced by rainfall (CRED 2023). Effective management of these disasters requires optimal water resources planning, which relies on high-quality rainfall data covering significant periods. Data gaps can occur for various reasons, including erroneous manual data entry, equipment errors during data collection or missing data due to defective storage technologies (Gao et al. 2018). Despite data gaps, many hydrological analyses rely on statistical approaches based on full-time series, such as the SPI method and low duration curve.

Understanding and addressing the issue of missing data is critically important for ensuring the validity and reliability of research findings. This study is justified by the need to mitigate the adverse effects that missing data can have on statistical analyses, particularly in hydrological research where precise data is crucial. Because the presence of missing data in any series has several implications: (I) decrease in the power and accuracy of statistical research methods (Roth et al. 1999), (II) the potential for biased estimates of relationships between two or more variables (Pigott 2001), (III) reduced representativeness of samples and (IV) complexities in the analyses used in the study (Kang 2013). Due to these reasons, having a gapless time series is a necessary prerequisite for the statistical and deterministic model approach used in hydrology (Gao et al. 2018). To solve the missing data problem, researchers have focused on two main approaches: deletion and imputation. However, before opting for the deletion of missing data, it is crucial to examine whether the deficiency in the dataset is a structural defect. If the missing data stems from a structural issue, deleting it may introduce bias into the model. Moreover, a significant amount of information may be lost.

Inadequate accounting for missing data, especially for rainfall or flow time series, can lead to a poor basin simulation and due to this fact, ineffective management of water resources might occur. (Gao et al. 2023). As a result, it is necessary to impute missing values with great care. Imputation methods also fall into two main categories: value assignment (Mean, Mode, Median, etc.) or estimation-based imputation. Predictive imputation methods include machine learning techniques (k-Nearest Neighbour, Artificial Neural Networks, Support Vector, Random Forest etc.), multiple imputation methods, and model-based assignment (Maximum Likelihood/EM) methods.

As seen above, although numerous methods exist for missing data imputation in the literature, some prominent ones include the following: Mean (Sanusi et al. 2017; Üresin 2021; Zhang and Thorburn 2022). The simplest method is commonly used to fill in missing data in meteorology and climatology. In the use of the arithmetic mean in the missing data, either the normal annual rainfall in the measurements at the surrounding stations should be in the range of 10% of the normal annual rainfall at the target station (Egigu 2020), or arithmetic mean imputation replaces missing values in a variable with the arithmetic mean of the observed values of the same variable (Gao et al. 2023). Another most preferred method is the Regression Analysis (Caldera et al. 2016; Mfwango et al. 2018). The whole regression procedure is a two-stage method: In the first step, a regression model is developed using all of the full observations and missing data is then imputed based on that model. One of the most commonly used machine learning techniques for missing data imputation is the k-nearest neighbour (kNN) algorithm (Sallaby and Azlan 2021; Sharma and Yuden 2021). The missing observation is estimated using the values of samples (neighbours) that are similar for one or more features. The most preferred model-based assignment method is maximum likelihood-based expectation maximization (EM) (Firat et al. 2012; Malan et al. 2020). Expectation maximization, while filling in missing data, provides accuracy and consistency by measuring how close the obtained estimates are, compared to the actual data. This case increases the reliability of the analysis results.

In addition to the methods mentioned earlier, recent studies have explored various approaches, leveraging advancing technologies. Owusu et al. (2019) evaluated three satellite rainfall products, TMPA 3B42RT, TMPA 3B42, and CMORPH, against gauged rainfall data using correlation coefficient (r), bias, and percent bias as evaluation methods. They found that TMPA 3B42 performed the best across daily, monthly, annual, and seasonal timescales, while CMORPH consistently overestimated rainfall at gauge locations. Chan Chiu et al. (2021) proposed sine cosine function fitting neural network (SC-FITNET), integrating principal component analysis (PCA) and a sine cosine algorithm, which outperformed other methods in imputing missing rainfall data. Addi et al. (2022) explored statistical imputation techniques for filling missing daily rainfall data, identifying regression, probabilistic principal component analysis (PPCA), and missForest as effective methods, particularly for capturing dry and wet periods and moderate to extreme rainfall events. Nascimento et al. (2022) applied self-organizing maps (SOM) to simulate monthly in flows using satellite-estimated rainfall, while Pinthong et al. (2022) conducted a study to evaluate different techniques for the estimation of missing monthly rainfall data. Their investigation encompassed six machine learning algorithms—Multiple Linear Regression (MLR), M5 model tree (M5), Random Forest (RF), Support Vector Regression (SVR), Multilayer Perceptron (MLP), and Gaussian Processes (GP)—as well as four spatial interpolation methods—Arithmetic Average (AA), Inverse Distance Weighting (IDW), Co-Kriging with Constant (CCW), and Nearest Neighbor (NR). The findings indicated that machine learning approaches exhibited superior performance compared to spatial interpolation methods, attributed to their capability to account for spatial constraints. Among the machine learning algorithms tested, GP demonstrated the highest efficacy in accurately estimating missing rainfall data, underscoring its potential utility in hydrological applications where spatial variability plays a critical role. Sahoo and Ghose (2022) discovered that the feed-forward artificial neural network (FNN), RF, kNN and SOM in completing missing values of rainfall data. The findings highlighted the superior performance of the FNN with error metrics, proving its effectiveness in managing data gaps in complex hydrological systems. Nida et al. (2023) evaluated imputation techniques across weather variables, favoring kNN for rainfall and mean imputation for temperature data. Khampuengson and Wang (2023) introduced full subsequence matching (FSM) as a novel approach for imputing missing values in telemetry water level data, aiming to address issues of incomplete or anomalous data caused by instrument failures. Their study compared FSM against established methods such as Interpolation, kNN, MissForest, and the long short-term memory (LSTM), demonstrating FSM's superior accuracy in imputing missing values, particularly for data exhibiting strong periodic patterns. Wangwongchai et al. (2023) investigated statistical techniques (STs) such as AA, MLR, and nonlinear iterative partial least squares (NIPALS), alongside artificial intelligence-based techniques (AITs) including long-short-term-memory recurrent neural network (LSTM-RNN), M5 model tree, and multilayer perceptron neural networks (MLPNN), for imputing missing daily rainfall data. Their findings highlighted that the M5 model tree (M5-MT) among the AITs and MLR among the STs were particularly effective, with MLR recommended for its accurate performance and straightforward application without requiring extensive prior modeling knowledge. Dariane et al. (2024) investigated various classical and machine learning methods for recovering missing streamflow data. Methods such as linear regression (LR&MLR), artificial neural networks (ANN), SVR, M5 tree, and Adaptive Neuro-Fuzzy Inference Systems (ANFIS) using Subtractive (Sub-ANFIS) and fuzzy C-means (FCM-ANFIS) clustering were compared, with machine learning approaches generally demonstrating superior performance. In the study conducted by Kannegowda et al. (2024), Kalman Smoothing with structured time series is recommended for small, medium, and large gaps in rainfall data, and Kalman–ARIMA is suggested for very large and mixed gaps. Among multivariate methods, superior performance across varying gap lengths is consistently demonstrated by RF. Kaur et al. (2024) employed multivariate imputation by chained equations and nearest neighbors techniques to handle missing weather data crucial for avalanche forecasting. Their study assessed six key weather variables, demonstrating improved forecasting accuracy and skill scores for artificial neural network-based models following data imputation. Loh et al. (2024) compared kNN, SVR, MR, and ANN techniques for imputing missing fine sediment data, finding ANN to consistently outperform the other methods across different missing data proportions. Apart from these studies, there are other research efforts available in the literature on missing data in these fields that can provide insights for future research on the practical application of hydrological modeling, structural engineering, and theoretical methods: In the study conducted by Tama et al. (2023), rainfall-induced runoff was predicted using a W-flow model. Additionally, Kencanawati et al. (2023) employed the rational method to determine peak discharge derived from surface runoff in their research.

In the field of hydrology, the conventional approach for filling in the missing values generally involves direct regression analyses. However, the development in the machine learning techniques, particularly in recent decades, have introduced alternative methods such as the ANN, the kNN algorithm and ANFIS. When determining the most suitable method among various alternatives, researchers often create a simulated dataset by intentionally deleting some of the data with known values and then estimate these missing values. However, in most hydrology studies, this process only considers the intentional deletion, neglecting the incomplete data structure when forming a simulated dataset. Notably, the actual amount of missing data is often disregarded. Tabachnick and Fidell (2012) argue that incomplete data mechanisms and patterns have a more significant impact on research results than the incomplete data rate.

This study aims to make a significant contribution to the literature by addressing the missing data problem, which is common in hydrology and related fields, by addressing the missing data structure and missing data techniques. Unlike previous studies that often focus on theoretical frameworks or limited case studies, the current approach rigorously applies and compares estimation techniques, including traditional methods such as mean, median, interpolation and innovative machine learning approaches such as k-nearest neighbor and finally EM, a model-based imputation technique using real historical data, to simulated datasets. To the best of the authors' knowledge, this study is one of the innovative studies in hydrology in which the missing data pattern, missing data count, and missing data mechanisms, which are critical evaluation criteria regarding missing data issues, are examined simultaneously and simulated data are created based on this. In this way, it aims to present a methodology regarding the missing data problem in hydrology. Additionally, in this study, the effects of normality assumption and station selection on Expectation Maximization performance were investigated. Furthermore, the study extends beyond imputation accuracy to include comprehensive homogeneity analyses, employing tools like Mathematica to assess the temporal and spatial consistency of completed datasets.

2 Study Area Description and Data Utilized

The Susurluk Basin, located in the western Turkey between 39°-40° north latitude and 27°-30° east longitude, covers approximately 3.11% of Turkey's total surface area, spanning about 24.349 km2 with a drainage area of 22.399 km2. The basin is characterized by its diverse topography, featuring Uludağ, the highest mountain in the Marmara Region, within its bounds. Extending in an east–west direction, this mountain system significantly influences the region. The basin encompasses parts of Balıkesir, Bursa, and Kütahya provinces. Noteworthy water bodies within the Susurluk Basin include Simav Stream, Nilüfer Stream, Mustafa Kemal Paşa Stream and Koca Stream. The Susurluk Basin experiences a transitional climate, exhibiting characteristics of both the Mediterranean and Black Sea climates (SBFMP 2018). With an annual mean rainfall is 688.54 mm and a mean annual flow is 5.43 km3/year, the basin plays a crucial role in Turkey's water resources. The region's significance is further underscored by the presence of two main freshwater lakes, Manyas Lake (24.400 hectares) and Uluabat Lake (19.900 hectares) both covered by the Ramsar Agreement (Mucan 2022). Given its strategic location and recent developments, including the construction of new water resource structures such as dams, the Susurluk Basin holds considerable economic and social importance for Turkey.

Continuity in the observation series cannot be ensured due to some reasons; such as changes in station locations, opening and closing of observation stations, equipment errors, planned maintenance or updates during the data collection process and lack of crew. Additionally, there are stations in the Susurluk Basin that were operational for a certain period but were later closed. Particularly since 2005, numerous stations have been established; however, the availability of the stations with long-term records is limited. Notably, there are no stations with a sufficient recording history to adequately represent the western part of the Susurluk Basin. To address this limitation, some stations located outside the basin were incorporated into this study. Given that the provinces of Balıkesir, Bursa, and Kütahya encompass a significant portion of the basin, data from all observation stations in these provinces were obtained from the Turkish State Meteorological Service. Subsequently, stations were selected from this extensive group, considering criteria termed adaptation parameters in this study. The selection process aimed to include stations that maximally align with the basin. The adaptation parameters considered including temporality, locationality and similarity.

In the evaluation that concerns temporality, a key consideration was ensuring that selected stations had records from the same starting date until the present day. The determination of the study period's commencement was influenced by the climate reference periods, which represent consecutive 30-year intervals calculated from climate data (Demircan et al. 2013, 2014). Climate modeling studies commonly utilize the data from climate reference periods such as 1961–1990, 1971–2000, and 1981–2010 as climate norms. Consequently, for this study, it was decided that stations with records spanning 1981–2021, including the years 1981–2010, were suitable for inclusion, as this period was deemed to be the representative of the basin and its surroundings within the context of temporality.

In the analysis in corporation with the location criteria, the distance of stations with records from 1981 to 2021 to the basin and whether the basin was located within the Thiessen polygon were influential factors. Using ArcGIS software, Thiessen polygons were delineated for the stations and their impact weights were calculated. It was determined that the Gediz and Gönen stations were located within the Thiessen polygon. Despite the Edremit station is not being situated within the Thiessen polygon of the basin, it was included in the study due to its proximity to west region within the basin. The primary rationale for this selection is to reduce the spatial variations of hydrological and climatic parameters, thereby increasing the reliability and representativeness of the data.

In the assessment that was conducted for similarity, annual rainfall levels in the basin were calculated over the study period. The relationship between stations outside the basin was examined using correlation coefficients, with careful consideration to ensure that stations located outside the basin were at least moderately compatible with the basin. Additionally, the missing value percentages of stations within the basin played a role to be able to determine which stations outside the basin should be included in the study. Consequently, 13 meteorological stations in the Susurluk Basin and its surroundings, that meet the criteria set under adaptation parameters, were selected for inclusion in the study, as depicted in Fig. 1. This methodology aims to ensure that the selected stations reflect more accurately the hydrological characteristics of the basin and to enhance the reliability of the study results.

Fig. 1
figure 1

Study area

Table 1 presents detailed information regarding the locations of the meteorological stations. Table 2 presents general descriptive statistical information of monthly total rainfall data calculated using SPSS software (2013). As seen from this table, the skewness coefficient values vary from 1.06 to 4.90. Notably, the Bursa station exhibits high skewness, indicating that a significant portion of the rainfall is concentrated around lower values, with fewer instances of high rainfall. Moreover, the Uludağ meteorology station registers a monthly total rainfall mean approximately twice of the Bursa meteorology station. Over the study period, the annual total rainfall at Uludağ reaches 2258 mm, compared to 1290.4 mm at the Bursa meteorology station.

Table 1 Location information of the meteorological stations
Table 2 General descriptive statistics of monthly total rainfall of the meteorological stations

3 Methods

3.1 General Considerations of Missing Data

This section examines the percentages of missing data, missing patterns, and various mechanisms of missing data. While these parameters are often overlooked in hydrology missing data imputation studies, they play a crucial role in data analysis and in determining appropriate strategies to be able to handle missing data. Each parameter is discussed below.

3.1.1 The Percentage of Missing Data

The percentage of missing data is vital for assessing the representing and reliability of the dataset. Low percentage indicates the more reliable dataset with stronger analysis results, while high percentage requires careful consideration when determining the strategies for handling the missing data. The acceptance of the missing data percentage depends on the research's purpose, sample size, and the mechanism of the missing data. While there is no exact threshold for an acceptable percentage of missing data, some studies have proposed distinct boundary values. For instance, Schafer (1999) suggested that a missing rate of 5% or below has minimal significance. Bennett (2001) proposed that missing data exceeding 10% is likely to introduce bias in statistical analysis. In certain statistical software, such as SPSS, 5% is used as a distinguishing point (Landau and Everitt 2004). When the proportion of the missing data is below 5% and the missingness is either completely at random (MCAR) or missing at random (MAR), it might be feasible to exclude the missing data or use an appropriate single imputation method. Conversely, in the same scenario (MCAR or MAR missingness), if the proportion of missing data exceeds 5%, sophisticated methodologies for imputing the missing values become necessary. In cases where the missing data is classified as Missing Not At Random (MNAR) and the missingness is attributed to selection bias, corrective techniques like the Heckman adjustment can be employed (Cheema 2014; Osman et al. 2018).

3.1.2 Missing Data Patterns

The concept of missing data patterns involves identifying both missing and observable values within a dataset. It reveals the distribution of missing data and whether these gaps follow a specific pattern. For instance, understanding whether missing values are specific to a particular feature, category, or time period is crucial for comprehending the missing data pattern. While there is no standard list of missing data patterns, the three most common patterns are univariate, monotonic and nonmonotonic, as illustrated in Fig. 2.

Fig. 2
figure 2

Missing data patterns (Blue: observed values, red: missing values (Emmanuel et al. 2021))

  • Univariate: There is a univariate missing data pattern when only one variable has missing data (Demirtas 2018; Emmanuel et al. 2021).

  • Monotone: This pattern occurs when the missing data follows a particular order. The presence of a monotone data pattern facilitates the handling of missing values, since the patterns among these missing values may be readily observed (Dong and Peng 2013).

  • Non-Monotone: This pattern does not follow any particular order or pattern, and the missingness of data occurred randomly or independently. Therefore, the missingness of one variable is not affected by the missingness of other variables (Chen 2022).

3.1.3 Missing Data Mechanisms

In order to learn more about the missing data problem, the cause of the missing data occurrence has been decomposed according to the various missing data mechanisms. Rubin (1976) appears to have been the first to introduce formally the missing data mechanisms of missing completely at random and missing at random. Rubin identified three mechanisms for missing data: Missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR) (Dong and Peng 2013). It is based on its relationship with the observed or unobserved values in the dataset. For detailed information about missing data mechanisms, the manuscript by Rubin (1976) can be used.

Kalaycıoğlu (2017) symbolized the missing data mechanisms as follows in order to be more easily understood:

In any study, let Yi represent the dependent variable for each individual i. In this case, the dependent variable Y can be divided into two parts, Yobserved and Ymissing, to indicate the observed and missing values, respectively. Furthermore, in the same study, let the p independent variables observed without missing values be defined by the matrix X = (X1, X2, …, Xk) (k = 1,…, p). Under these conditions, for each individual i, the missing data index matrix R for the dependent variable Yi can be defined as follows:

$${R}_{i}=\left\{\begin{array}{c}{\mathrm{1, If Y}}_{\mathrm{i}} \, {\mathrm{i}}{\mathrm{s}} \, {\mathrm{m}}{\mathrm{i}}{\mathrm{s}}{\mathrm{s}}{\mathrm{i}}{\mathrm{n}}{\mathrm{g}}\\ {\mathrm{0, If Y}}_{\mathrm{i}} \, {\mathrm{i}}{\mathrm{s}} \, {\mathrm{o}}{\mathrm{b}}{\mathrm{s}}{\mathrm{e}}{\mathrm{r}}{\mathrm{v}}{\mathrm{e}}{\mathrm{d}}\end{array}\right\}$$
  • Missing completely at random (MCAR): The probability of the missing data is not associated with any observed or missing value of the dependent variable that contains missing data in the dataset.

$$f{\mathrm{(R|Y}}^{\mathrm{observed}}{\mathrm{,Y}}^{\mathrm{missing}}\mathrm{)}=f(R)$$
(1)
  • Missing at random (MAR): The probability of the missing data occurrence in variables with the missing data is only related to the observed values, but from variables with the missing data is independent.

$$f{\mathrm{(R|Y}}^{\mathrm{observed}}{\mathrm{,Y}}^{\mathrm{missing}}\mathrm{)}=f(R{\mathrm{|Y}}^{\mathrm{observed}})$$
(2)

Under this assumption, the probability of the missing data on the dependent variable may also be related to observed or missing data on the independent variables. Namely,

$$f{\mathrm{(R|Y}}^{\mathrm{observed}}{\mathrm{,Y}}^{\mathrm{missing}}\mathrm{)}=f(R{\mathrm{|Y}}^{\mathrm{observed}},X)$$
(3)
  • Missing not at random (MNAR): The probability of the missing data in the dependent variable, this is related to the missing data \(\left(Y^{missing}\right)\)  , in the variable itself. Under this mechanism, an assumption about why the missing data occurs must be included in the statistical analysis using composite models. However, the inclusion of this assumption, which cannot be verified without prior knowledge of why the data is missing, is possible with more complex statistical models than the other methods. Because of this practical difficulty, the statistical modeling in the presence of non-random missing data has not been widely used in the literature.

The mechanisms for missing data are defined by the probability of missing data occurrence. When this probability is entirely unrelated to other measured variables, it is presumed that the remaining sample is a random subsample (Missing Completely at Random—MCAR). However, if there is a relationship between other measured factors and the likelihood of missing data, it can be inferred that the data is not MCAR. Nevertheless, MNAR is never definitively ignored because, in practice, the missing data itself is never known. Statistical tests in literature can be employed to determine whether the missing data is entirely random or not. In this study, Little's (1988) MCAR test, one of the most preferred methods, was applied.

3.2 Dealing with the Missing Data in Rainfall Data

Over time, numerous approaches have been developed to estimate missing values ​​in a dataset. This section discusses the missing values ​​approaches used in this study. These approaches can be broadly classified into four categories (Fig. 3).

Fig. 3
figure 3

Taxonomy of the missing data techniques used in this study

3.2.1 An Overview of Simple Missing Data Handling Techniques

For decades, dozens of methods have been utilized to address the issue of missing data. In this section, the most commonly used simple methods in the literature are mentioned. The simple imputation strategy involves substituting the missing values for each individual value by utilizing a quantitative or qualitative feature derived from the available non-missing data (García-Laencina et al. 2009). Various approaches, such as mode, mean, or median, are employed in simple imputation to address the missing data by utilizing the existing values. Simple imputation approaches are frequently employed in most research due to their simplicity and their utility as a convenient reference strategy (Jerez et al. 2010). The arithmetic mean approach is used to calculate the incomplete rainfall record when the normal annual rainfall of the neighboring stations is within a range by ± 10% of the normal annual rainfall of the target station. However, if this condition is not met, the normal rate method is used for the same purpose or mean imputation can be made by using the values of the station with the missing value. In this study, as stated in Sect. 4.1., since many of the stations have missing values at the same time, each of the station's own values were used in the imputation with the mean. Since the records of the neighboring stations could not be used in the study, both the mean of the series during the study period and the mean of the previous and next two values of the missing value were imputed instead of each missing value in the station with the arithmetic mean. A similar process was applied by calculating the median of the nearby points. One of the simple approaches used to fill in missing data in any time series is spatial interpolation or temporal interpolation methods. In this study, the temporal interpolation technique was used by using the observed values just before and after the missing data.

3.2.2 k-Nearest Neighbors (kNN) Imputation for Missing Values in Machine Learning

There are several ways available for imputing the missing data, including one of the most often used being hot-deck imputation methods. A deterministic variation of these approaches is the “nearest neighbour” (NN) imputation algorithm (Andridge and Little 2010). The hot-deck imputation approaches include the replacement of the missing values in instances with missing data (recipients) with values obtained from cases (donors) that exhibit similarity to the receiver in terms of observable attributes (Beretta and Santaniello 2016). The major disadvantage of the hot-deck attribution is the difficulty in defining the concept of ‘similarity’. Therefore, the hot-deck procedure does not provide a standard path for missing data. However, it is an important technique as it allows missing values to be retrieved from a dataset without the need for additional mathematical or statistical data (Kalton and Kish 1984). Due to its relatively fast and simple algorithm, the hot deck imputation has become very popular among missing data imputation methods (Fadillah and Muchlisoh 2020).

The K-nearest neighbor algorithm (kNN) is one of those algorithms used for classification in Supervised Learning based on the distance function created with the parameter k. Several distance measures, including the Minkowski distance, Manhattan distance, Cosine distance, Jaccard distance, and Hamming distance, can be used for kNN imputation; however, the Euclidean distance is reported to be the most efficient and productive (Amirteimoori and Kordrostami 2010; Emmanuel et al. 2021). The Phyton programming language and the Scikit-Learn (Scikit-Learn 2023), Pandas (Pandas 2023) libraries were used for missing imputation with the kNN algorithm in this study. For detailed information about the kNN, the manuscript by Emmanuel et al. (2021) can be used.

3.2.3 Expectation–Maximization (EM) Algorithm

The Expectation–Maximization (EM) algorithm is a commonly employed iterative technique for estimating the maximum likelihood parameters in statistical models (Dempster et al. 1977). Furthermore, it facilitates the process of estimating parameters in probabilistic models that use incomplete data (Dikbas 2017). The missing values are firstly calculated using the estimated model parameters in the application of this method. These completed missing values are then used to recalculate the model parameters and this process is repeated. In the missing data completion, the EM algorithm does not take the cause of the gaps into consideration of the dataset and assumes that they are completely random. One of the most important advantages of the EM method is that the algorithm can be applied even if there are mutually between the missing values in the series and no measured values are neglected. The Gaussian Probability Distribution (normal distribution) of a multivariate data can be represented by the mean and covariance matrix. That means, the mean and covariance matrix are appropriate statistics of the normal distribution. The EM method uses an iterative algorithm and estimates the means, covariance matrix and correlations of quantitative variables with missing values. This method, which is an approach to iterative calculation of maximum likelihood (ML), estimates in various missing data problems. There are two steps in each iteration of the EM algorithm: The first step is the E-step, called the expectation step and the second step is the M-step, called the maximization step. In the E or expectation step, the missing data and the model parameters are estimated with the given observation values. In the M or maximization step, the missing data are assumed to be known and the parameters that will maximize the expected probability function in the E step are determined. This is used in the next step E to determine the distributions of the model parameters. The convergence is achieved as the probability increases with each iteration step with this algorithm. Do and Batzoglou (2008) offer a comprehensive introduction to the mathematical underpinnings and practical applications of the expectation–maximization (EM) approach. In this study, missing value analysis was conducted using the EM modules in the toolbox of IBM SPSS software (2013).

The effect of the station selection and normality assumption in imputation with Expectation Maximization is also one of the research topics of this study. In this study, the correlation matrix was taken into account for the selection of the station. In the normality assumption, the values obtained from the EM imputation applied to the raw data were compared with the values obtained as a result of the EM imputation to the transformed forms of the data determined to be not normally distributed. In this contex, there are three basic methods that can be used to test the assumption of normality: Descriptive methods; examination of skewness, kurtosis, mean, mode, median values, Graphical methods; examination of histogram, stem-and-leaf graph, box-and-whiskers graph, P-P (probability) and Q-Q (percentage) graphs and statistical methods are Shapiro–Wilk, Kolmogorov–Smirnov, Jarque–Bera etc. In the literature, the fact that skewness and kurtosis are between certain limit values is accepted as an indicator that the data complies with the assumption of normal distribution. These limit values should be between ± 1 according to Hair et al. (2013), ± 1.5 according to Tabachnick and Fidell (2012), and ± 2 according to George and Mallery (2010).

Although skewness and kurtosis values provide researchers with a wider range of evaluations to evaluate the assumption of normality, statistical tests reveal more precise results. Many studies use and compare different tests to validate the normality. In this study, raw and transformed versions of the data were examined with three different approaches: skewness/kurtosis, Shapiro–Wilk (Shapiro and Wilk 1965) and Jarque–Bera (Jarque and Bera 1980) test. The Shapiro–Wilk and Jarque–Bera tests focus on different properties and evaluate different assumptions. While the Shapiro–Wilk test is especially effective in small samples (Pituch and Stevens 2016), the Jarque–Bera test provides a more comprehensive analysis by focusing on features such as skewness and kurtosis. Therefore, the final evaluation of the normality assumption was made with the Jarque–Bera test. However, the changes occurring at Shapiro–Wilk were also followed at every stage. For the Shapiro–Wilk test, SPSS the software normality calculation toolbox and the tseries (tseries 2023) library in R were used to calculate the Jarque–Bera test. The flow chart of the study, prepared to facilitate understanding of which analyzes were carried out at which stage of the study, is presented in Fig. 4.

Fig. 4
figure 4

Flow chart of the study

3.3 Evaluation Criteria for Missing Data Imputation

To evaluate the accuracy of the prediction, error criterion parameters of mean absolute error (MAE), root mean square error (RMSE) and mean biased error (MBE) were used. Root mean square error (RMSE) is a statistical measure that measures the discrepancy between observed and predicted values. The equations of these metrics are shown in the following (Niazkar et al. 2023)

$$RMSE=\sqrt{\frac{{\sum }_{i=1}^{n}{\left({x}_{i}^{observed}-{x}_{i}^{predicted}\right)}^{2}}{n}}$$
(4)

MAE quantifies the mean magnitude of a prediction set’s errors as:

$$MAE=\frac{1}{n}{\sum }_{i=1}^{n}|{x}_{i}^{observed}-{x}_{i}^{predicted}|$$
(5)

MBE is used primarily to estimate the mean bias in the model and to decide whether any steps need to be taken to correct the model bias.

$$MBE=\frac{1}{n}{\sum }_{i=1}^{n}({x}_{i}^{predicted}-{x}_{i}^{observed})$$
(6)

where n denotes the number of data, \({x}_{i}^{observed}\) represents the ith observed value, and \({x}_{i}^{predicted}\) represents the ith predicted value.

3.4 Homogeneity Test

The utilization of homogeneous series in climate change research holds critical importance. Changes in homogeneous series are attributed to variations in climate and weather patterns (Conrad and Pollak 1950). Several factors, such as the change in the position of the observation station, modifications in the observation format, and structural alterations in the station’s surrounding environment, can impact the quality and dependability of long-term climatological time series (Peterson et al. 1998). The presence of discontinuities in non-uniform time series, which are not attributable to environmental variables, introduces uncertainty in accurately determining changes in rainfall when such data are used for climate studies. Therefore, it is imperative to assess the homogeneity of observational data before incorporating it into any research endeavor. In the event that non-homogeneous data is identified, it should be either eliminated or adjusted to achieve homogeneity. Climate scientists have developed and employed numerous approaches to assess the homogeneity of the data under consideration (Klingbjer and Moberg 2003; Ducre-Rubiatille et al. 2003; Tomozeiu et al. 2005; Staudt et al. 2007; Modarres 2008).

In this study, the homogeneity is determined by a two-step approach suggested by Wijngaard et al. (2003). In the first step, the aim is to check the homogeneity of all stations with four tests: (I) Standard Normal Homogeneity Test (SNHT) (Alexandersson 1986), (II) Pettitt’s test (1979), (III) Buishand’s test (1982), and (IV) Von Neumann’s test (1941). The details of these tests are shown in Table 3. The test statistics in the respective table were computed using Mathematica software, and the results were evaluated within a 95% confidence interval.

Table 3 Formulas of the homogeneity tests (Hırca et al. 2022)

Homogeneity is checked by testing the null hypothesis (H0). The H0 hypothesis shows that there is no change, which implies that the data under investigation is homogeneous. In the second step, the stations are divided into three classes according to the homogeneity results:

  • Class 1: Homogeneous (one or zero tests reject the H0 at the 0.05 significance level)

  • Class 2: Doubtful (two tests reject the H0 at the 0.05 significance level)

  • Class 3: Suspect (three or four tests reject the H0 at the 0.05 significance level)

4 Results and Discussion

4.1 Creation of the Simulated Datasets

The location of the missing data plays a crucial role in dataset integrity. The efficacy of model the performance, particularly in methods like median imputation of nearby points, mean of nearby points or series mean, is directly influenced by the spatial distribution of missing data. A simulated dataset, initially created in the form of a holistic dataset (449*13 data matrix), can lead to inaccurate results, especially during months with seasonal transitions, such as may and august. This is due to the consideration of september in the missing value imputation for august, resulting in an above-normal rainfall imputation. To address this issue, simulated datasets were generated on a monthly scale, considering the number of missing value, the missing data pattern and the missing data mechanism in the real datasets. The generation of simulated datasets is based on the deletion of rows containing missing values in the real dataset at all stations. Subsequently, in this complete dataset, a simulated dataset was created by replicating the same missing value patterns observed in the real dataset. This process was repeated for each month with missing values. Table 4 presents the mean and standard deviation values for both the real datasets and the simulated datasets. SPSS software was used and Little’s MCAR test was applied to determine the missing data mechanisms. The results of Little’s MCAR test indicated that the missing value mechanism in the real datasets exhibited a Missing Completely at Random (MCAR) structure. Furthermore, it was confirmed that the simulated datasets shared the same MCAR structure. If missing values in a dataset are MCAR (Missing Completely At Random), it indicates that the probability of a value being missing is unrelated to both observed and unobserved data. This indicates the following:

Table 4 Comparison of real and simulated datasets in terms of missing values

Randomness

The missingness in the dataset is distributed randomly, without any discernible pattern or connection to the observed data or the missing values themselves.

No Bias

The absence of data is not causally linked to any particular traits or values within the dataset.

The presence of missing values in an MCAR structure suggests that simple imputation techniques, such as mean or median imputation, can be appropriately utilized in the study. However, for data affected by Missing at Random (MAR) or Not Missing at Random (NMAR) mechanisms, more sophisticated imputation methods such as multiple imputation or predictive modeling techniques may be necessary.

Figure 5 shows the missing data patterns for july and september for both real and simulated datasets. One of the most important parameters in the missing data studies is the missing data pattern. The procedure generally applied to impute the missing values in hydrology is based on selecting a key station for the station with missing values. In most cases, the missing value at the target station is completed using thes key station data. Therefore, in most imputation approaches (such as regression analysis, normal rate method, and some machine learning methods) estimation of missing rainfall data is possible when data is available at other stations. However, when missing values are found at all stations at the same time, the methods that directly use the data of the key station cannot be preferred. Therefore, missing data patterns should be examined in determining the methods to be used in the study.

Fig. 5
figure 5

Missing data pattern for (a) july and (b) september

In Fig. 5, each row in the dataset represents a different pattern of the missing values and indicates a group of samples with the same pattern of missing values. These patterns or groups of cases are organized depending on the specific variables where the missing values occur. Stations on the x-axis are ranked according to the amount of missing values. When the missing value pattern given as an example is examined, the stations with the highest missing values are on the far right of the graph and the stations with the least missing values (or no missing values) are on the far left. The initial pattern is always one, which contains no missing values. It can be seen that there are 11 different patterns in the missing data pattern for july and 9 different patterns in the missing data pattern for september. For example, while Bursa station could be used to complete Bigadiç station in july, it shows that it cannot be completed in september because there is a missing value in Bursa station in the same year. Therefore, the missing data pattern plays an important role in selecting the key station to be used for imputation.

4.2 Missing Value Imputation in Simulated Datasets

Missing rainfall values in the simulated datasets were estimated monthly by using various imputation methods, including series mean, mean of nearby points, median imputation of nearby points, linear interpolation, Hot-Deck, kNN and EM algorithms. Due to the simultaneous missing values at the stations, the key station-based methods applicable in any month were rendered inapplicable. Consequently, the column-based imputation techniques, which use the station’s own records, were preferred in the study. However, EM imputation, allowing simultaneous completion, was also utilized. Different scenarios were created based on station the selection and normality assumption in EM imputation:

  • Scenario 1: EM is imputed to the raw data by matching it with the station with which it is most compatible, as indicated by the single correlation matrix (Fig. 6) in the month with the missing value. In this scenario, the same station is used every month for matching.

  • Scenario 2: EM is imputed to the raw data by matching it with the most compatible station in the month to be completed where the missing value is found Tables 5.

  • Scenario 3: In the normality test results of the raw data, only the transformed versions of those that are not normally distributed are completed with the station which has the highest correlation in a single correlation matrix (Table 6).

  • Scenario 4: In the normality test results of the raw data, only the transformed versions that are not normally distributed are completed with the station with the highest correlation in the month in which the missing value is found (Table 7).

Fig. 6
figure 6

Spearman’s rho correlation analysis of simulated raw data in 449*13 matrix form

Table 5 Normality analysis of simulated raw data (for Scenario 2)
Table 6 Normality analysis of simulated raw data in 449*13 matrix form (for Scenario 1 and Scenario 3)
Table 7 Normality analysis of transformed versions of only non-normal distributed simulated data (for Scenario 3 and Scenario 4)

When the correlation is examined separately each month, the station with the missing value is matched with the station with the highest correlation according to the monthly correlation analysis results. In a single correlation matrix, simulated rainfall data are listed from January 1981 to December 2021 and the first the normality analyses of the stations are evaluated (Table 6). Then the correlation analysis is performed according to the normality status of the stations. In Scenario 1 and Scenario 3, where a single correlation matrix is used, the stations with the highest correlations are matched in the same way in all months (Fig. 6).

In many studies, the assumption of the normality is often overlooked. However, depending on the result of the normality assumption, the researchers choose between parametric or non-parametric methods. Failing to investigate the normality can lead to erroneous inferences. The statistical tests can be preferred in normality tests because they provide clear results. While different limit values exist in the literature for cases where statistical tests are not preferred, according to Table 5 and Table 7, in most cases where the skewness/kurtosis coefficients are between ± 1, the Jarque–Bera test revealed that the rainfall series follows a normal distribution.

One of the continuous variables may not meet the Pearson correlation normality assumption. In such cases, Spearman’s rho correlation is an alternative nonparametric method to determine whether a linear relationship exists between two variables. Therefore, in this study, the Jarque–Bera and Shapiro–Wilk tests were examined comparatively at all stages, where correlation analysis is required. But if the H0 hypothesis is rejected in Jarque–Bera method, it is accepted that the rainfall series are not normally distributed. Hypotheses created under the assumption of normality;

H0

The data conforms to normal distribution.

H ı

The data does not comply with normal distribution.

Since the raw datasets are used in Scenario 1 and Scenario 2, the assumption of normality is only used to determine the correlation analysis showing the relationship between stations (Pearson? or Spearman's rho?). Scenario 2 is based on correlation analyses calculated separately for each month. Therefore, according to the normality analysis results in Table 5, the correlation coefficients are calculated for each month and the station with which the station containing the missing value is most compatible is determined.

The normality test results of the raw data prepared within the scope of Scenario 1 and Scenario 2 and are given in Table 5. There are various approaches to investigate the normality of rainfall data. While some studies consider skewness/kurtosis coefficients (Basu et al. 2004; Guo 2022), some studies consider Shapiro–Wilk (Mohammed and Scholz 2023) and others consider Jarque–Bera test (Ünlükara et al. 2010; Ahani et al. 2012; Weslati et al. 2023). Both Shapiro–Wilk and Jarque–Bera tests were performed to evaluate the assumption of the normality. Both tests focus on different characteristics; while the Shapiro–Wilk test is especially effective in small samples (Pituch and Stevens 2016), the Jarque–Bera test provides a more comprehensive analysis by focusing on features such as skewness and kurtosis. In this context, since it was desired to evaluate the skewness and kurtosis properties of the datasets in more detail, the final normality evaluation was carried out according to the Jarque–Bera test.

The procedural steps performed for the normality analysis results in Tables 5, 6 and 7 are explained in detail below;

  1. 1.

    Collect Raw Data: Gather the data for analysis from stations.

  2. 2.

    The normality test is performed using appropriate statistical tests to check if the raw data follows a normal distribution. Common tests include the Jarque–Bera test, Shapiro–Wilk test or skewness/kurtosis coefficients.

  3. 3.

    Based on the results of the normality tests, each dataset is classified as either normally distributed or non-normally distributed.

  4. 4.

    Stations that are not normally distributed are transformed using methods such as square root, logarithmic or cube root.

  5. 5.

    The statistical test is selected based on the normality test results, choosing between Spearman's rho correlation coefficient (for non-normally distributed data) or Pearson's correlation coefficient (for normally distributed data).

Before each stage requiring correlation analysis, normality analyses were performed consistently by adhering to the steps mentioned above. Transformation methods were applied for months and stations that did not show normal distribution in Table 5. Then, these steps were applied for each station. In order to understand the effect of transformation processes on normality, normality tests were performed again and the results are given in Table 7. As seen from the relevant table, the transformation of non-normally distributed data shows improvement according to the skewness/kurtosis, Shapiro–Wilk, and Jarque–Bera test results, indicating that the data either normalize or approach normal distribution.

Under the light of Table 5, 6 and 7 it is possible to make the following comments:

  • Shapiro–Wilk is a very sensitive method for evaluating the normality assumption. Even if the skewness and kurtosis values are ± 1, there are datasets that are not normally distributed according to the test.

  • The fact that there were 9 missing months in the study and the dataset was divided monthly led to 117 normality tests. Approximately 38% of the raw datasets were determined to be normally distributed. This proves that rainfall data often has a distorted and irregular structure by nature.

  • Skewness and kurtosis coefficients provide researchers with broader ranges (e.g., ± 1.5 or ± 2) for assessing normality. Therefore, assuming normality based on these values is generally easier. However, skewness and kurtosis offer only an intuitive perspective on evaluating the normality of a dataset. Hence, using normality tests for a comprehensive assessment leads to more reliable results. In this study, it was determined that the Jarque–Bera test accepts the dataset as normally distributed in most cases where the skewness and kurtosis coefficients are within the ± 1 range. Therefore, in studies where normality is assessed solely based on skewness and kurtosis, considering ± 1 instead of broader thresholds is a more appropriate approach.

  • Among the three different normality approachs that are examined in this study, the tests can be ranked from the strongest to the weakest as follows: Shapiro–Wilk, Jarque–Bera, and skewness/kurtosis coefficients.

  • The transformation of non-normally distributed data shows improvement according to the skewness/kurtosis, Shapiro–Wilk, and Jarque–Bera test results, indicating that the data either normalize or approach normal distribution.

According to Table 8, which includes the evaluation according to the error metrics, the results according to RMSE and MAE error metrics are close to each other. MBE indicates how much the measurements or predictions deviate from the actual values. If MBE is close to zero, it indicates that the predictions are close to the actual values. Since the error metrics gave similar results and the MBE value of Scenario 2 was -0.19, it was decided to implement this scenario. While median assignment of the nearby points was the least performing method, kNN was determined to be the most effective method after the Expectation Maximization.

Table 8 Determining the most appropriate method

Different scenarios have been created to address the necessity of the normality assumption in the expectation maximization process. Based on the created scenarios, the results of the Expectation Maximization can be summarized as follows:

  • The use of the Expectation Maximization (EM) algorithm for imputing missing data offers advantages such as flexibility, a robust statistical foundation, an iterative nature, the ability to handle missing data directly, minimizing data loss, preserving data distribution, and widespread availability. These advantages make the EM algorithm an effective and reliable method for missing data analysis.

  • In this study, the stations close to each other were not directly matched. For instance, Uludağ and Mustafa Kemal Paşa are stations that are closer to each other (Fig. 1), but they are physically different in terms of topographic, meteorological, and hydrological aspects. The correlation analysis, being a statistical method, does not incorporate physical events, so there is no issue even if the stations are far away. The correlation analysis examines the relationship between two time series with the same units (rainfall). Since the most imputation methods are statistical analyses, the matching stations based solely on their proximity is not a correct approach. In fact, from different months of this study, it was determined that the correlation of the relationship between two stations very close to each other was very weak.

  • As a calculation approach, the expectation maximization is not affected by the order. For example, logarithmically matched stations and square roots are calculated based on their order. It is a very useful method as it allows finding missing values at the key station.

  • It was determined that EM imputations made after the transformation processes produced biased results.

The findings of the study, as stated by Khalifeloo et al. (2015), suggest that expectation maximization (EM) should be preferred as it offers a fast and iterative approach to missing data imputation.

4.3 EM Imputation and Homogeneity Analysis to Real Datasets in Scenario 2

After establishing that Scenario 2 was the most suitable method for the simulated rainfall datasets, normality analyses were initially applied to the real data. Following this, correlation analyses (Spearman’s rho or Pearson) were conducted for pairwise combinations based on the normality of the stations, and the most compatible station pairs were determined for each month. Finally, the complete rainfall series were obtained by applying EM.

If the rainfall series completed as a result of missing data analyses are to be used in subsequent hydrological, meteorological, climate change, and forecasting studies, they must be hydrologically/statistically reliable. For this reason, Standard Normal Homogeneity Test (SNHT), Pettitt, Buishand, and Von Neumann Ratio homogeneity tests, which are frequently included in the literature, were applied to detect inhomogeneities in the annual total rainfall series. Test statistics for homogeneity tests were calculated in the Mathematica software (2017) and evaluated according to the 95% confidence interval. The findings obtained from the homogeneity analyses are given in Table 9. The study highlights that the Pettitt test is more sensitive in detecting inhomogeneity in series.

Table 9 Homogeneity analyzes*

According to Table 9, it has been determined that the majority of stations are homogeneous based on the results of homogeneity analyses. This finding provides a solid foundation for making accurate predictions and reliably evaluating long-term trends in studies such as climate change research or hydrological modeling.

5 Conclusion

Missing data estimation is important for the sustainable management of water resources, as missing data can make it difficult to determine appropriate policies and strategies. The main purpose of this research is to present a methodology for missing data estimation in hydrology. In this context, simulated datasets were created by considering the number of missing data, missing data pattern and missing data mechanism of real datasets containing missing values, which are often overlooked in hydrology. This paper provides a comparison of simple imputation approaches, machine learning technique and model-based imputation method. For this purpose, a missing data imputation study is carried out for the period 1981–2021. The application of the proposed missing data methodology is given for the monthly total rainfall of the Susurluk Basin. In EM, which is a model-based assignment method, scenarios created on station selection and normality assumption allow comparison of the performance of these selections on the method. EM is determined as the most suitable assignment method, followed by the kNN method. The Jarque–Bera test generally works well for distributions with medium to long tails and test generally indicated that the rainfall series followed a normal distribution when skewness and kurtosis coefficients were within the range of ± 1. Correlation analyses between geographically close stations revealed that proximity alone does not guarantee strong correlation in rainfall patterns, emphasizing the need for a comprehensive statistical approach rather than relying solely on geographical proximity for station matching. In future applied climatological studies, it is recommended to evaluate hybrid methodologies that combine the benefits of various approaches such as statistical techniques (STs) and artificial intelligence-based techniques (AITs) discussed in the introduction, while adhering to the methodology presented in this study, when reliable key stations with no missing data can be selected. These techniques would be even more advantageous if they also account for the critical factor emphasized in this study, namely the missing data pattern.