1 Introduction

The existence of missing data makes it very difficult to realize accurate data analyzing and modeling. In fact, the data missing is not only common in industry, commerce, and scientific research fields [25, 30] but also inevitable in those scenarios. Generally data missing happens due to the errors or the failures of instrument or operation. Without careful considerations of missing data, domain experts cannot efficiently and precisely understand what their data really indicate [9].

For the hydrology, the agriculture, or other ecological fields, the ecological dataset is generally obtained by either human-operating devices or remotely automatic devices [19, 26, 43, 44]. Nowadays, a popular methodology of implementing large-scale, micro-level ecosystem monitoring is to deploy wireless sensor networks [4, 29] in the sites concerned by scientists. It benefits much—decreasing the costs of human resources and maintenance and realizing real-time observations across geographically distributed regions [5, 10, 31, 33, 35]. In practice, however, the use of wireless sensor network in ecosystem monitoring poses new challenges—the dataset collected by wireless sensing systems, often called wireless sensory dataset, often experiences more significant data losses. First, wireless ecology sensing systems are usually left unattended in outdoor environments, say, tropical forests, cold regions, wetlands, desserts, and riversides, and are expected to operate for a long term, say, a few weeks or even months. These systems are very prone to be accidentally damaged by extreme weather conditions, such as storms, rains, or lightening, and therefore cannot always record ecological events constantly. Moreover, limited labor resources or unpredictable harsh weathers sometimes lead to infeasible visits to remotely deployed devices, and consequently, the damage or the failure of devices often cannot be discovered until the next routine checking, which aggravates the data loss so much that the missing values or even records in the dataset occur one after another—forming large gaps in the dataset.

Second, the low-power low-rate wireless links used to form ecological sensors into a network are unreliable and rendered dynamics sometimes [20] and consequently cannot deliver all the obtained data to end-users—also leading to non-ignorable data missing. Different than traditional datasets in which missing values are very sparsely scattered, therefore, the wireless sensory dataset inevitably suffers missing values that occur continuously in a larger range and considerably undermine the completeness of dataset. Also, it is worth noting that strictly speaking, repeating the operations of obtaining ecological data does not make sense because of the ceaseless temporal–spatial dynamics of natural environment. Figure 1 indicates the incompleteness in the dataset obtained by a small-scale wireless sensor network we use to monitor the hydrology in forests. Clearly, we can see the continuous data missing due to the failed data communication through wireless links (at sensor 2) and the unattended battery depletion (at sensor 3). In summary, processing the data loss or determining desirable methods of infilling missing values is still the first step for scientists of hydrology or agriculture to well understand environmental evolution, even though they benefit from wireless ecology monitoring systems in terms of human resources and real-time data acquisition.

Fig. 1
figure 1

Illustration of the continuous data missing in a dataset whose data points are returned by two wireless sensors

Until now, however, researchers have not yet paid attention to infill the continuous missing values in wireless sensory time-series datasets and have little knowledge about which existing methods are possibly effective under such a case. This study investigates several typical approaches of infilling missing data designed for traditional time-series dataset and Extreme Learning Machine (ELM), which has not been employed in missing-data infilling, and examines their performances in dealing with large-scale continuous data missing in the dataset of soil moisture. The purpose is to analyze and determine which approaches will be more potential for this new task and to give some insights for designing new data missing infilling policies for wireless sensory ecological dataset. Our work is based on the soil moisture dataset because as a critical environmental factor [21], the soil moisture data are a common input to hydrologic and agricultural models in the soil and water management activities [2, 3, 8, 12, 27].

The rest of this paper is organized as follows: Section 2 shows the related works about missing value imputation methods. Section 3 presents eight methods of infilling missing data and how to apply them in our dataset. Section 4 introduces the soil dataset we use and evaluates the performances of these eight methods in terms of accuracy. Finally, Sect. 5 concludes this study.

2 Related work

Various methods have been employed to infill the missing values appearing different scientific fields such as statistical methods, machine learning methods, and data mining methods. The authors in [13] apply hybrid methods, which combine the k-nearest neighbor and the dynamic time warping to infill the missing values in gene expression time-series data. [22] shows how genetic algorithms are used to develop locally weighted regressive models (LWR) and time delay neural network (TDNN) for estimating missing data and compare these two sophisticated methods on short-term hourly volumes of traffic missing counts. The results show that LWR outperforms TDNN. In [32], a piecewise interpolation method based on the cubic Ball and Bzier curves representation is presented to infill the missing value of solar radiation.

Recently, there have been more attempts that study the missing values infilling methods for soil moisture datasets. The authors in [41] present a three-dimensional method, based on the discrete transforms, for filling the missing values of the satellite images dataset of soil moisture. Dumedah and Coulibaly [7] treat the soil moisture dataset to be a time series and investigate the effectiveness of six methods, including the multiple linear regression, the weighted Pearson correlation coefficient, the station relative difference, the soil layer relative difference, the monthly average, and the merged method. In their subsequent work [8], they further evaluate nine neural network based infilling methods; they find that the nonlinear autoregressive neural network, the rough set method, and the monthly replacement can achieve better accuracy in comparison with the methods in their previous paper. Kornelson and Coulibaly [17] examine the effectiveness of the monthly average, the soil layer relative difference, the linear and cubic interpolation, the artificial neural networks, and the evolutionary polynomial regression infilling methods; the evaluation results show that the interpolation and the artificial neural network methods are more effective, yet only for infilling small gaps in dataset.

However, these methods all assume small gaps in the datasets and then are unable to be effectively applied to infill continuous missing data inherently existing in the wireless sensory datasets. In this paper we test the Extreme Learning Machine (ELM) and seven typical methods to evaluate their performance, which are the Linear Interpolation (LI), the Soil Layer Relative Difference (SLRD), the Autoregressive Integrated Moving Average (ARIMA), the Vertical Multiple Linear Regression (VMLR), the Horizontal Multiple Linear Regression (HMLR), the Weighted K-Nearest Neighbors (WKNN), and Radial Basis Function networks (RBFs). We conduct comprehensive numeric experiments based on a soil moisture dataset with various missing gaps, and we compare their imputation performances on a soil moisture dataset involving unsteady records.

3 Description of infilling methods

This section will introduce eight methods for infilling missing values in the soil moisture dataset. They are Linear Interpolation (LI), the Soil Layer Relative Difference (SLRD), the Autoregressive Integrated Moving Average (ARIMA), the Vertical Multiple Linear Regression (VMLR), the Horizontal Multiple Linear Regression (HMLR), the Weighted K-Nearest Neighbors (WKNN), the Radial Basis Function networks (RBF), and the Extreme Learning Machine (ELM). The reasons why we chose these eight methods are as follows. The LI, a simple but effective method, is commonly used in practice to infill missing values. The SLRD is commonly employed by field experts to infill the hydrological data. The ARIMA model always appears in the reconstruction of time-series data with missing values. The VMLR and HMLR are the multiple linear regressions to infill the missing soil moisture values; the difference is that they leverage different sensing attributes in modeling: The first uses the attributes from different layers of a given station, and the second uses the attributes from different stations at the same layer. The WKNN is a kind of typical machine learning method, which is also used to predict the missing values. ELM and RBF are both Single Layer Feed Forward Neural Networks (SLFNs). The RBF recently has widely used to impute the missing values and achieved the ideal results. However, the ELM shows better generalization performances and then has been applied to many fields recently, such as hydrology, pattern recognition, neuroscience, and consumer electronics; we want to know the potential of ELM in the scenario of the soil data infilling. We develop programs based on the R language and MATLAB to implement those methods.

3.1 Linear interpolation (LI)

Based on the curve fitting with linear polynomials, the linear interpolation (LI) is a simple but effective method in practice [23]. The LI fills the missing values of time series by Eq. (1), where \(y_{0}\) and \(y_1\) are the soil moisture values on time \(t_{0}\) and \(t_1 (t_1 > t_0)\), respectively, and then y will be the missing value on time t which ranges from \(t_{0}\) to \(t_{1}\).

$$\begin{aligned} y=y_{0}+(y_{1}-y_{0})\frac{t-t_{0}}{t_{1}-t_{0}} \end{aligned}$$
(1)

3.2 Soil layer relative difference (SLRD)

For infilling missing values of soil data, field experts often resort to the SLRD method [40], which usually employs the parametric test of relative difference among soil moisture data. Equation (2) shows how to impute the missing soil moisture data. Suppose that there are n soil-monitoring stations in a given region, each of which reports a time-series soil records including samples returned by the probes of different depths (layers). For a given sampling depth j, in Eq. (2), \(\theta _{i,j}(t)\) represents the soil moisture of depth j at station i at time t, and \(\bar{\theta }_j(t)\) represents the average over the depth-j soil moisture values reported by all the n stations at time t. And, \(\delta _{i,j}\), called the relative difference, is calculated by the first equation of Eq. (2).

$$\begin{aligned} \delta _{i,j}(t)= & {} \frac{\theta _{i,j}(t) - \bar{\theta }_{j}(t)}{\bar{\theta }_{j}(t)} \nonumber \\ \bar{\theta }_{j}(t)= & {} \frac{1}{n}\,\sum _{i=1}^{n}\theta _{i,j}(t) \end{aligned}$$
(2)

Note that the SLRD method only takes into consideration the data with as the same depth as the missing data, because it assumes that across different stations, the soil moisture data with an identical depth are relatively correlated [7]. When soil moisture is missing at depth j of station i at time t, \(\bar{\theta }_j(t)\) is computed by the available depth-j data of all the other stations, while \(\bar{\delta }_{i,j}\) is estimated by the mean of all the values of the j-th depth at station i. The estimated soil moisture \(\theta _{\mathrm{est}}\) can be expressed with Eq. (3).

$$\begin{aligned} \theta _{\mathrm{est}}(t) = \bar{\theta }_{j}(t) + \bar{\theta }_{j}(t) \times \bar{\delta }_{i,j} \end{aligned}$$
(3)

3.3 Autoregressive integrated moving average (ARIMA)

Typical for statistics, the ARIMA model is widely used to analyze the time-series data [16]. In fact, ARIMA involves three types of models: the autoregressive model (AR), the moving average model (MA), and the model (ARMA) combining MA and AR. To process a non-stationary data series, like the soil data we use, ARIMA has to difference this data series to make it stationary for the const statistical properties. We do not consider the seasonal effect of our soil data because it is collected in winter; we then use the non-seasonal ARIMA(pdq) model [11] to predict (infill) the missing values, in which p is the number of autoregressive term, d, the number of non-seasonal differences for keep stationary, and q, the number of lagged forecast errors. The general ARIMA model is given together in Eqs. (4) and (5).

$$ y_{t} = \left\{ {\begin{array}{*{20}l} {Y_{t} } \hfill & {d = 0} \hfill \\ {Y_{t} - Y_{{t - 1}} } \hfill & {d = 1} \hfill \\ {(Y_{t} - Y_{{t - 1}} ) - Y_{{t - 1}} - Y_{{t - 2}} } \hfill & {d = 2} \hfill \\ \cdots \hfill & \cdots \hfill \\ \end{array} } \right. $$
(4)
$$\begin{aligned} \hat{y}_t& {}= \mu + \phi _1y_{t-1} + \cdots \phi _py_{t-p} - \theta _1e_{t-1} - \cdots \theta _qt_{t-q} \end{aligned}$$
(5)

In Eq. (4), \(Y_t\) is the observed data series until time t, \(y_t\) is d-th difference of \(Y_t\), and generally, that \(d\in [0,4]\) is adequate to lead to a stationary series. For the general forecasting given by Eq. (5), \(\phi _i (1\le i\le p)\) and \(\theta _i (1\le i\le q)\) are model parameters, while p and q are the model orders. The parameters \(\phi _i\) and \(\theta _i\) are often estimated according to the least square or the maximum likelihood methods.

When a missing value is of sequence k in the whole data, ARIMA first chooses a sub-series of length \(L_k\) before the k-th data point. In this paper, we plot the original soil moisture data and find its non-stationarity. After empirically differencing the non-stationary soil data of length \(L_k\) with a proper d, we can determine a desirable pair of p and q by examining the auto-correlation and the partial correlation of \(y_t\). Finally we mainly use the arima function provided by the R language to complete the missing value imputation.

3.4 Vertical multiple linear regression (VMLR)

Each sensing probe attached to the station can sample not only the soil moisture but also the soil temperature and the electrical conductivity data. The VMLR method assumes that for a given depth k, the soil moisture data of depth k correlate both with the soil moisture values of other depths and with the temperature and the electrical conductivity of depth k.

$$\begin{aligned} \hat{y}_k = a_1\times t_k + a_2\times c_k + \sum _{i=1,i\ne k}^{m}b_i\,y_i \end{aligned}$$
(6)

If there are m layers, the VMLR model is expressed in Eq. (6) where \(t_k\) and \(c_k\) represent the temperature and the electrical conductivity of depth k, respectively, and \(y_i\), the soil moisture value of depth \(i (i\ne k)\). Therefore, the task of the VMLR is to find parameters \(a_1\), \(a_2\), and \(b_i\).

3.5 Horizontal multiple linear regression (HMLR)

Similar to the VMLR method, the HMLR method also uses the multiple linear regression to infill the missing soil moisture values. Yet the HMLR method focuses on the correlation of data points at the same depth from different stations; in other words, for a given station s, the soil moisture data of depth k at s correlate both with the soil moisture values of depth k of other stations and with the temperature and the electrical conductivity of depth k at station s. The correlation of sensing attributes from nearby sensors is often employed to predict the missing data due to faulty devices [42].

$$\begin{aligned} \hat{y}_{s,k} = a_1\times t_{s,k} + a_2\times c_{s,k} + \sum _{i=1,i\ne s}^{m}b_i\cdot y_{i,k} \end{aligned}$$
(7)

The HMLR model is given by Eq. (7) where m denotes the number of stations, \(t_{s,k}\) and \(c_{s,k}\) are the temperature and the electrical conductivity values of depth k at station s.

3.6 Weighted K-nearest neighbors (WKNN)

The WKNN resorts to K similar observations to impute missing values. In practice, the Euclidean distance is commonly used to determine the similarity between two data points. For the simplicity, suppose that data point x has a missing value at attribute a, denoted by \(x^{(a)}\), and that there are n data points, \(y_1\), \(y_2\),...\(y_n\) in the training dataset. The similarity between x and \(y_i (1\le i\le n)\) can be calculated by Eq. (8) where m is the number of attributes of x or \(y_i\).

$$\begin{aligned} d(x,y_i)=\sqrt{{\sum _{j=1,j\ne a}^{m}\left( x^{(j)}-y_{i}^{(j)}\right) ^2}} \end{aligned}$$
(8)

After obtaining all the distances between x to \(y_i\), we can determine the k-nearest neighbors. For instance, if the k-nearest neighbors of x are shown in Fig. 2 and the distance from x to \(y_i\) is equal to \(d_i\), we can infill \(x^{(a)}\) with \(\hat{x}^{(a)}\) calculated by Eqs. (9) and (10), both of which together express an implementation of a K-nearest neighbors model with a weighted function.

$$\begin{aligned} \hat{x}^{(a)}& {}= \sum _{i=1}^{k}\,y_i^{(a)}\,w(d_i) \end{aligned}$$
(9)
$$ w(d_i) = e^{-d_i} $$
(10)
Fig. 2
figure 2

Illustration for the WKNN method where the white block represents the attribute with missing value

3.7 Extreme learning machine (ELM)

Extreme learning machine (ELM) proposed by [14, 15] is a kind of machine learning method for Single Layer Feed Forward Neural Networks (SLFNs). It shows better generalization performances and higher speed of learning process, compared with the SVM and other traditional SLFNs trained by gradient-based algorithms. Therefore, the ELM has been applied to many fields, such as hydrology [6, 38, 45], pattern recognition [24], neuroscience [36], and consumer electronics [1].

Fig. 3
figure 3

General topology of the ELM model

Figure 3 illustrates the basic schematic topological structure of an ELM network. Briefly, the basic theory of the ELM model states that for N arbitrary distinct input samples \(({\mathbf {x}}_{k},{\mathbf {y}}_{k})\in { R^{n}\times {R^{m}}}\), the standard SLFNs with M hidden nodes and an activation g(.) function can be mathematically described as Eq. (11)

$$\begin{aligned} \sum _{i=1}^{M}{{\varvec{\beta }}_{i}}g({\mathbf {x}}_{k};c_{i},{\mathbf {w}}_{i})={\mathbf {y}}_{k} \quad k=1,2, \ldots N \end{aligned}$$
(11)

where \(c_{i} \in {R}\) is the bias of the ith hidden node, \({\mathbf {w}}_{i}\in {R}\) is the input weight vector connecting the ith hidden node and the input nodes, \({\varvec{\beta }}_{i}\) is the weight vector connection the ith hidden node to the output node, and \(g({\mathbf {x}}_{k};c_{i},{\mathbf {w}}_{i})\) is the output of the ith hidden node with respect to the input sample \({\mathbf {x}}_{k}\). In ELM, the input weights and hidden biases are randomly generated. By doing so, the nonlinear system can be converted to the following linear system:

$$\begin{aligned} {\mathbf {H}}\times {\varvec{\beta }}={\mathbf {Y}} \end{aligned}$$
(12)

where \({\mathbf {H}}\), \({\varvec{\beta }}\), and \({\mathbf {Y}}\) are expressed with Eqs. (13), (14), and (15) shown as follows, respectively.

$$\begin{aligned} {\mathbf {H}}= & {} \left( \begin{array}{ccc} g({\mathbf {x}}_1; c_1, {\mathbf {w}}_1) &{} \ldots &{} g({\mathbf {x}}_1; c_M, {\mathbf {w}}_M) \\ \vdots &{} \vdots &{} \vdots \\ g({\mathbf {x}}_N; c_1, {\mathbf {w}}_1) &{} \ldots &{} g({\mathbf {x}}_N; c_M, {\mathbf {w}}_M) \end{array} \right) _{N\times M} \end{aligned}$$
(13)
$$ {\varvec{\beta }}= (\varvec{\beta }_{1}^{\mathrm{T}}, {\varvec{\beta }}_{2}^{\mathrm{T}},\ldots {\varvec{\beta }}_{M}^{\mathrm{T}})^{\mathrm{T}}_{m\times M} $$
(14)
$$ {\mathbf {Y}}= ({\mathbf {y}}_{1}^{\mathrm{T}},{\mathbf {y}}_{2}^{\mathrm{T}},\ldots {\mathbf {y}}_{N}^{\mathrm{T}})_{m\times N}^{\mathrm{T}} $$
(15)

Thus, determining the output weights \({\varvec{\beta }}\) is as simple as finding the minimum norm least-square (LS) solution to the linear system, described as Eq. (16). As been analyzed by [14], by using such a MP inverse method, ELM tends to obtain good generalization performance and increase the learning speed dramatically.

$$\begin{aligned} \hat{\varvec{\beta }}= {\mathbf {H}}^{\dag }{\mathbf {Y}} \end{aligned}$$
(16)

In this paper, we denote, by \({\mathbf {Y}}\), the attributes with missing values, and by \({\mathbf {X}}\), the other attributes. All complete records were used to train the ELM. \({\mathbf {w}}_{i}\) and \(c_{i}\) are randomly generated within \([-1,1]\). In order to ensure the statistical significance of the result, in this paper, we repeat 100 times training and predicting processes and use average predicting values to infill the missing values. It is worth noting that the number of hidden neutrons has great influence on the accuracy of the prediction. Through a large number of experiments, shown in Fig. 4 we found that the more the number of neurons, the greater the accuracy, but too many neutrons do not improve the prediction accuracy significantly. Therefore, we empirically set a topology of ELM with 60 neurons in the hidden layer.

Fig. 4
figure 4

Effect of the number of neutrons on the accuracy

3.8 Radial basis function networks (RBF)

Radial basis function networks are also a kind of SLFNs, so the topology of ELM is the same with the RBFs. Different than the other SLFNs, for the RBF model, the weights between input layer and hidden layer are set to be one. In addition, for a given input \({\mathbf {x}}\), each hidden node of the RBF model will employ a radial basis function to quantify the degree of activation, both of which significantly reduce the SLFN parameters and make the SLFNs easy to be implemented. The general topology of the radial basis function neural networks (RBF NNs) is shown in Fig. 5 where \({\mathbf {y}}_{k}\) is a weighted linear combination of the activation degrees of incoming input \({\mathbf {x}}_k\):

$$\begin{aligned} \sum _{i=1}^{M}{\varvec{\beta }}_{i}\varTheta _{i}({\mathbf {x}}_{k})={\mathbf {y}}_{k}\quad k=1,2,\ldots N \end{aligned}$$
(17)

In the case with the Gaussian type of RBFs, we have

$$\begin{aligned} \varTheta _{i}({\mathbf {x}}_{k})=\exp (-\sigma _{i}\Vert {\mathbf {x}}_{k}-{\varvec{\nu }}_{i}\Vert ^{2}) \end{aligned}$$
(18)

where \({\mathbf {x}}_{k}\) = \([x_{1},\ldots x_{n}]^{\mathrm{T}}\), represents the n-dimensional input vector, and \({\varvec{v}}_{i}=[v_{1}, \ldots v_{n}]^{\mathrm{T}}\)and \(\sigma _{i}\) represent center vector of the ith nodes of the hidden layer and the spread parameter of the ith nodes of the hidden layer, respectively. The notation \(\Vert \cdot \Vert \) in Eq. (18) is the function calculating the Euclidean distance. There are many algorithms to train RBF models [18, 28, 34, 37, 39]. In this paper, we use the standard RBF training algorithm of MATLAB neural network toolbox to infill the missing values in the soil moistures.

Fig. 5
figure 5

General topology of the RBF model

4 Analysis

4.1 Soil dataset

The dataset used in this paper is collected by a soil-monitoring system deployed in the Jiufeng National Forestry Park, Beijing, China; this system is shown in Fig. 6, and it locates at 115.7E\(^\circ \) and 39.4\(^\circ \)N (marked with a red point). Beijing is of dry and monsoon-influenced humid continental climate, where the daily average temperature is only −3.7 \(^\circ \)C in January and the precipitation from June to August is about three-fourths of the total yearly precipitation. In this system, there are three soil-monitoring stations around ten meters away from each other; they report their data to a data logger which buffers the collected the data in local SD card. Every week, an operator manually pulled out the soil data file from the SD card. Each station is equipped with five soil sensing probes arranged at five top-bottom layers (depths): 2, 5, 10, 15, and 20 cm, respectively, and each probe simultaneously captures three attributes with an interval of 15 minutes: the soil moisture, the soil temperature, and the soil electrical conductivity. Also, the logger associates a timestamp with each record.

Fig. 6
figure 6

Deployment of the soil-monitoring site at Jiufeng National Forestry Park, Beijing, China

4.2 Setup

The monitoring system in our study site operated from October 2010 to October 2012. In the whole dataset of two years, there are a large amount of irregularly distributed data losses. We elaborately find that the set of records obtained from October 2010 to January 2011 involves only one missing soil moisture value; therefore, this set of records can be reasonably reckoned to be a complete dataset; specifically, we choose the data—returned by the sensing probe of depth 5 cm at a station—as the benchmark dataset to evaluate the eight infilling methods. The benchmark has 6060 records of three soil attributes (the soil moisture, the soil temperature, and the soil conductivity). Figure 7 shows the distribution of all the soil moisture values in the benchmark dataset.

Fig. 7
figure 7

Soil moisture data from the sensing probe of depth 5 cm at a station

To simulate the continuous missing characteristics of soil dataset returned by wireless sensing system, we artificially specify various missing segments with different ratios. We first remove the missing segment from the benchmark dataset and then apply the eight imputation methods to infill the values in this missing segment. Table 1 gives the missing ratios used in this paper. The choices of five missing ratios are determined by the inspection (physical visit) period in practice, which is usually half a day, 1 day, 1 week, or 1 month (4 weeks). It is clear, in Fig. 7, that the moisture varies steadily before the 2500-th data point, but drastically after the 3000-th data point. So, to evaluate the performance of the eight methods under steady and dynamic time-series data, we specify two data points in the benchmark dataset: Start I, the 1001-th data point, and Start II, the 3001-th data point, as labeled in Fig. 7. According to Table 1, missing segments 1–5 all start from Start I and missing segments 6–10 all start from Start II.

$$\begin{aligned} \hbox {RMSE}=\sqrt{\frac{1}{n}\times {\sum _{i=1}^{n}(\hat{y}_{i}-y_{i})^2}} \end{aligned}$$
(19)

In this study, we use the root-mean-square error (RMSE), widely adopted in the community [8], to evaluate the eight methods of infilling the missing soil moisture data. In detail, as shown in Eq. (19), RMSE is the root of the average squared differences between the predicted value (\(\hat{y}_i\)) and the original one (\(y_i\)). In general, the smaller the RSME derived by a method is, the better the effectiveness of this method is.

Table 1 Configuration of the missing records in evaluation

4.3 Results

This section compares the performance of the eight infilling methods with different missing scales and different fluctuations. Figure 8 plots the data points infilled by the eight methods and the real data points. Note that the observed soil moisture values are marked by black circles, and the predicted values of the eight methods are marked by different colors shown as the legend in Fig. 8a. For the shortest missing segment over the steady dataset (Fig. 8a), these infilling methods except for the SLRD all work well; for the longest missing segment over the fluctuating dataset (Fig. 8j), the eight methods differentiate much in performance. From Fig. 8, it can be seen that as the missing ratio increases, the LI and the ARIMA both give straight lines to fit the observed data, regardless of the beginning points of missing segment (Start I or Start II), while the SLRD always performs poorly. However, the VLMR, the ELM, the WKNN, the RBF, and the HMLR all can predict the variation trend of the datasets even with different accuracies. Furthermore, we can see that the fitting performance of the HMLR and the RBF both experience a significant degradation in the case with 44.36% missing ratios. Nevertheless, estimates from the VLMR, the ELM, and the WKNN are very well consistent with the trend of observed soil moisture, especially the VLMR in larger missing ratios.

Fig. 8
figure 8

Comparisons of the eight methods with five different missing ratios. The legend for a works for all the other sub-figures. a 0.79% missing from the 1001-th data point. b 0.79% missing from the 3001-th data point. c 3.17% missing from the 1001-th data point. d 3.17% missing from the 3001-th data point. e 11.09% missing from the 1001-th data point. f 11.09% missing from the 3001-th data point. g 22.18% missing from the 1001-th data point. h 22.18% missing from the 3001-th data point. i 44.36% missing from the 1001-th data point. j 44.36% missing from the 3001-th data point

The eight methods are further compared in Fig. 9. It is obvious that the imputation performances of these eight methods all degrade in different degrees for the missing segments that start at Start II and involve drastically varying data points. It is worth mentioning that the imputation performances of the LI and the ARIMA become very poor, when the missing segments are chosen from here. The LI method uses only two reference points; therefore, it does not work well for fluctuating datasets, especially when the missing gap is larger. The ARIMA just uses a segment of steady data before the missing values (Start II), which does not contain sufficient information (large or periodic dataset is preferable for ARIMA) and consequently results in lower performance. The RBF has a significant degradation in the \(44.36\%\) missing ratios from the Start II. The reason is that there are no sufficient various training samples to build the RBF model. And compared with the ELM, the RBF has the risk of overfitting or underfitting as a result of the restriction of the standard RBF training algorithm in MATLAB neural network toolbox. Interestingly, the VMLR demonstrates the most steady and precise prediction as the dataset becomes unsteady and the missing ratio is larger. Both the VMLR and the HMLR employ the multiple linear regression to infill the missing soil moisture values, but the VMLR is preferred to the HMLR—suggesting that for a given station, the different soil layers (depths) for the VMLR can profile the temporal correlation of soil moisture with higher accuracy, i.e., the data from vertically arranged layers at the same station render closer correlation, in comparison with the same layers at different stations.

Fig. 9
figure 9

Comparisons of the eight methods with different missing ratios. a Missing segments starting from the 1001-th data point. b Missing segments starting from the 3001-th data point

A comprehensive numeric comparison in terms of RMSE is given in Table 2. For the missing segments beginning at Start I, in average, the ELM is the best predictor, followed by the RBF, the WKNN, the VMLR, all of which are not significantly different, and the worst is the SLRD. For the missing segments beginning at Start II, in average, the VMLR performs the best, followed by the ELM, both of which are similar, and the SLRD still is the worst. We can conclude that when being applied to infill the dataset with the shortest missing ratio, the LI is recommended, considering that this method is simple and has a relatively high infilling accuracy. As missing ratio of the dataset is increasing, the ELM and the RBF are suggested to infill the missing values. However, it is noticeable that the VMLR, with the average RMSE of 0.217% over the missing segments beginning at Start II, outperforms other methods and seems suitable to infill unsteady dataset with the largest missing ratio. Based on the time-series dataset used in this paper, the evaluation results show that the VMLR, the ELM, the RBF, the WKNN, the LI, the ARIMA, and the HMLR are all preferred to the SLRD, which is commonly used by field experts.

Table 2 RSMEs of eight methods with different missing ratios. Given a method, avg. I represents the average RSME over segments from 1 to 5 and avg. II, the average RSME over segments from 6 to 10

5 Conclusions

Ecological time-series dataset collected by wireless sensing systems often experiences continuous data losses which pose new challenges for missing-data processing. Researchers now have little knowledge about effective approaches to addressing this issue. This paper has investigated seven typical methods that are used to infill missing data in a soil time-series dataset and ELM which has not been employed in this task. We find that totally, the VMLR, the ELM, and the RBF can achieve a better accuracy in infilling continuous missing soil moisture data. In detail, to infill short missing segments, the ELM, the RBF perform desirably. The reason is that the ELM and RBF are Single Layer Feed Forward Neural Networks (SLFNs). They both have the ability to approximate arbitrary function, but the ELM shows better generalization. To infill missing values in unsteady soil dataset with a larger continuous missing segments, the VMLR overwhelms all the other methods, and the accuracy of the ELM is slightly lower than that of the VMLR. Therefore, we can see the ELM has a promising potential to infill the missing values in different missing segments. For all the specified missing segments, the VMLR is almost always preferred to the HMLR, indicating that the data from different layers of a given station are more strongly correlated than the data from different stations at the same layer. Thus, taking into account the correlation among multiple factors will be a promising start to design effective approaches of infilling the missing values in wireless soil datasets.