Keywords

1 Introduction

Fault tolerance is an essential requirement for distributed data stream processing systems. Existing systems, such as Flink [2], process dynamically generated data in real-time by pipeline processing. Each task distributed to multiple servers receives inputs from the previous one and sends the processing results to the next one. This processing model enables efficient data processing, but a failure in one task affects the entire system. Since many applications, such as environmental monitoring, require high availability and low latency, processing systems need a mechanism to continue processing even if each server fails.

Existing systems generally guarantee fault tolerance by switching to standby nodes and reprocessing lost data [2, 3, 6, 12]. When the manager detects a worker failure, a newly launched worker takes over the processing of the failed one. Besides, some systems can recover from the failure without error by reprocessing only the lost data. This approach is suitable for cluster environments where the system can manage all resources centrally and assign the input data management task to the message queue.

In edge computing environments, however, we need a different approach to guarantee fault tolerance. Edge computing is a new processing model for traffic reduction and load balancing [15]. Each edge performs simple data processing such as aggregation and filtering, as shown in Fig. 1. For example, consider environmental sensing applications in a smart city. Environmental sensors widely installed in various places send measurements to the nearby edge by radio. Edges aggregate the inputs at regular intervals and send the result to the server. When an edge fails, another one at a distance cannot receive the measurements and therefore cannot take over the processing instead. That is, we have to duplicate each edge for the conventional approach: switching to standby nodes and reprocessing lost data. This duplication cost increases sharply with expansion in the system scale, unlike the cloud environment, which can manage all resources centrally.

Fig. 1.
figure 1

Overview of the edge computing

To solve this problem, we focus on environmental monitoring applications and propose an approach to guarantee approximate fault tolerance without edge duplication. We deal with various failures by estimating missing data using the correlation between sensors rather than reprocessing. This approach guarantees the error bounds for a user-specified confidence threshold and allows the server to continue subsequent processing without waiting for the inputs from failed edges.

In this paper, we focus on aggregation queries and assume that some measurements are lost due to sensor/communication failures. Edges can handle these failures by estimating missing data from observed data and aggregating them approximately [8]. For example, if sensor D fails in Fig. 2, edge 2 estimates missing data from the measurements of sensor C and continues aggregation without waiting for the inputs from sensor D.

We extend this approach to the problem of edge failures. That is, the server estimates the outputs of failed edges from the others to continue processing. In Fig. 2, aggregation results for sensors A and B are lost due to the failure of edge 1. The server estimates them from the results for sensors C and D obtained from edge 2 and continues the subsequent processing without waiting for the recovery of edge 1.

In our approach, due to the approximate aggregation at each edge, the server has to take care of the uncertainty of the aggregation results to predict the error bounds accurately. There are various techniques to estimate missing values, such as mean imputation [10] and Gaussian process regression [14]. However, these techniques cannot take care of the uncertainty of the estimation in each edge, which causes the lack of reliability. On the other hand, our method handles the uncertainty as a probability distribution and guarantees the error bounds appropriately.

Fig. 2.
figure 2

Overview of the proposed approach

The rest of this paper is organized as follows. In Sect. 2, we discuss related work on fault tolerance in data stream processing systems. In Sect. 3, we explain the approximate aggregation at edges, which is the basis of our approach. In Sect. 4, we propose an approach to support approximate fault tolerance and provide a method to estimate the missing aggregation results taking care of the uncertainty. We evaluate the validity of our approach by experiments in Sect. 5, and we conclude the paper in Sect. 6.

2 Related Work

Many systems guarantee at-most-once semantics and at-least-once semantics [3, 13]. A system that guarantees at-most-once semantics only switches processing to the standby machines to recover from failures, but this approach can result in data loss. On the other hand, a system that guarantees at-least-once semantics reprocesses the lost data to recover from failures, but the same data can be processed multiple times. That is, these systems cannot guarantee the error bounds caused by failures.

Some stream processing systems, such as Flink [2] and Spark Streaming [6], guarantee exactly-once semantics. These systems recover from failure without error by restoring the last checkpoint of the internal states and reprocessing only lost data. This approach is suitable for cluster environments where the system can manage all resources centrally and assign the input data management task to the message queue.

However, for edge computing, there is a problem that the conventional approach needs to duplicate each edge. This duplication cost increases sharply with expansion in the system scale, unlike the cloud environment, which can manage all resources centrally. Besides, reprocessing lost data causes a delay because new inputs are generated one after another even during the reprocessing. We can suppress the delay by parallelizing failure recovery [16], but this approach also needs duplication.

AF-Stream supports approximate fault tolerance to address the trade-off between performance and accuracy [11]. This system mitigates the overhead by adaptively checkpointing while guarantees that the error upon failure is within the user-specified threshold. However, this system also requires duplication and is not suitable for edge computing environments.

3 Preliminaries

In this section, we explain the idea of approximate aggregation at edges [8], which is the basis of our approach.

In this paper, we assume that all n sensors \(\boldsymbol{X} = \{ X_1, X_2, \dots , X_n \}\) send measurements periodically and synchronously to the nearby edge. Let \(\boldsymbol{x}^t = \langle x_1^t, x_2^t, \dots , x_n^t \rangle \) be the true value of \(\boldsymbol{X}\) at time \(t \in \mathbb {N}^{+}\) and let \(\boldsymbol{X}_i \subseteq \boldsymbol{X}\) be the sensor set assigned to edge i. If sensor failures or communication failures occur, edges cannot receive measurements from some sensors. In other word, at time t, edge i only obtains the measurements \(\boldsymbol{o}_i^t\) from sensors \(\boldsymbol{O}_i^t \subseteq \boldsymbol{X}_i\). We assume that each sensor is assigned to only one edge, and the observed values are always equal to the true ones.

Edges divide the series of observations by time windows and aggregate them approximately for each time window. In this paper, we deal with two aggregation queries: sum and average. Let \(t'\) be the start point of a time window and let w be the window width. Edge i calculates the aggregation result \(\boldsymbol{Y}_X\) for each sensor \(X \in \boldsymbol{X}_i\) from the series of observations \(\boldsymbol{o}_i^{\left[ t', t'+w\right) } = \langle \boldsymbol{o}_i^{t'}, \boldsymbol{o}_i^{t'+1}, \dots , \boldsymbol{o}_i^{t'+w-1} \rangle \). However what we really want is the aggregation result for \(\boldsymbol{x}_i^{\left[ t', t'+w\right) }\) not for \(\boldsymbol{o}_i^{\left[ t', t'+w\right) }\). It indicates that simply aggregating \(\boldsymbol{o}_i^{\left[ t', t'+w\right) }\) causes errors due to missing data and has no theoretical guarantee of the error bounds.

To solve this problem, we estimate missing data from observations. We utilize the correlation between sensors \(\boldsymbol{X}\) modeled by a multivariate Gaussian distribution (hereafter, just Gaussian) \(\mathcal {N}(\boldsymbol{\mu }, \Sigma )\). Suppose that \(\boldsymbol{\mu }\) and \(\Sigma \) denote the mean vector and the covariance matrix of \(\boldsymbol{X}\), respectively. For the measurements \(\boldsymbol{o}_i^t\) of sensors \(\boldsymbol{O}_i^t\), edge i calculates the posterior distribution \(\mathcal {N}(\boldsymbol{\mu }_{\boldsymbol{X}_i^t \mid \boldsymbol{o}_i^t}, \Sigma _{\boldsymbol{X}_i^t \mid \boldsymbol{o}_i^t})\) for sensors \(\boldsymbol{X}_i^t = \boldsymbol{X}_i \setminus \boldsymbol{O}_i^t\) [7]:

$$\begin{aligned}&\boldsymbol{\mu }_{\boldsymbol{X}_i^t \mid \boldsymbol{o}_i^t} = \boldsymbol{\mu }_{\boldsymbol{X}_i^t} + \Sigma _{\boldsymbol{X}_i^t \boldsymbol{O}_i^t}\Sigma _{\boldsymbol{O}_i^t \boldsymbol{O}_i^t}^{-1}\left( \boldsymbol{o}_i^t - \boldsymbol{\mu }_{\boldsymbol{O}_i^t}\right) \mathrm {\ and} \end{aligned}$$
(1)
$$\begin{aligned}&\Sigma _{\boldsymbol{X}_i^t \mid \boldsymbol{o}_i^t} = \Sigma _{\boldsymbol{X}_i^t \boldsymbol{X}_i^t} - \Sigma _{\boldsymbol{X}_i^t \boldsymbol{O}_i^t}\Sigma _{\boldsymbol{O}_i^t \boldsymbol{O}_i^t}^{-1}\Sigma _{\boldsymbol{O}_i^t \boldsymbol{X}_i^t}. \end{aligned}$$
(2)

The subscripts of each symbol indicate corresponding rows/columns in \(\boldsymbol{\mu }\) and \(\Sigma \). For instance, \(\Sigma _{\boldsymbol{X}_i^t \boldsymbol{O}_i^t}\) is a sub matrix of \(\Sigma \) formed by rows \(\boldsymbol{X}_i^t\) and columns \(\boldsymbol{O}_i^t\), which represents the covariance between \(\boldsymbol{X}_i^t\) and \(\boldsymbol{O}_i^t\).

Finally, we aggregate the estimated distributions in the window. In the following, we focus on the average query but we can handle the sum query in the same way. Since Gaussians are in the family of stable distributions, without temporal correlation, a Gaussian \(\mathcal {N}(\boldsymbol{\mu }_{\boldsymbol{Y}_{\boldsymbol{X}_i}}, \Sigma _{\boldsymbol{Y}_{\boldsymbol{X}_i}})\) for the average \(\boldsymbol{Y}_{\boldsymbol{X}_i}\) of sensors \(\boldsymbol{X}_i\) is represented as follows [8]:

$$\begin{aligned}&\boldsymbol{\mu }_{\boldsymbol{Y}_{\boldsymbol{X}_i}} = \frac{1}{w} \left( \sum _{t \in [t', t'+w)} \boldsymbol{\mu }_{\boldsymbol{X}_i \mid \boldsymbol{o}_i^{t}} \right) \mathrm {\ and} \end{aligned}$$
(3)
$$\begin{aligned}&\Sigma _{\boldsymbol{Y}_{\boldsymbol{X}_i}} = \frac{1}{w^2} \left( \sum _{t \in [t', t'+w)} \Sigma _{\boldsymbol{X}_i \mid \boldsymbol{o}_i^{t}} \right) . \end{aligned}$$
(4)

Note that, for sensors \(\boldsymbol{O}_i^t\), the mean vector is its measurements \(\boldsymbol{o}_i^t\) and all the elements of the covariance matrix are 0.

This method theoretically guarantees the error bounds. Let \(f(\boldsymbol{x} \mid \boldsymbol{\mu }, \Sigma )\) be the probability density function for \(\mathcal {N}(\boldsymbol{\mu }, \Sigma )\). For an error bound e, we can calculate the probability that the average \(Y_{X}\) of the sensor \(X \in \boldsymbol{X}_i\) is within e of the mean \(\boldsymbol{\mu }_{Y_X}\):

$$\begin{aligned}&P\left( Y_{X} \in [\boldsymbol{\mu }_{Y_{X}} - e, \boldsymbol{\mu }_{Y_{X}} + e]\right) = \int _{\boldsymbol{\mu }_{Y_{X}} - e}^{\boldsymbol{\mu }_{Y_{X}} + e} f\left( y \mid \boldsymbol{\mu }_{Y_{X}}, \Sigma _{Y_{X}}\right) dy. \end{aligned}$$
(5)

That is, for a user-specified confidence threshold \(\delta \), we can calculate the minimum error bound \(e'\) that satisfies \(\delta \) by solving:

$$\begin{aligned} P(Y_{X} \in [\mu _{Y_{X}} - e', \mu _{Y_{X}} + e']) = \delta . \end{aligned}$$
(6)

This guarantees that the true value of \(Y_{X}\) is within \(\mu _{Y_{X}} \pm e'\) with the probability of \(\delta \). Note that \(\Sigma _{Y_{X}}=0\) means that all data of X are observed in the time window and the aggregation result has no error (i.e. \(e'=0\)).

4 Proposed Approach

In this section, we propose our approach to support low-latency and high-availability fault tolerance without edge duplication. First, we explain how to guarantee approximate fault tolerance in our approach. Next, we provide a method to estimate the outputs of failed edges taking care of the uncertainty of the processing results.

4.1 Overview

We assume that we have m edges that send the aggregation results to the server periodically and synchronously. Note that edge i sends a Gaussian \(\mathcal {N}(\boldsymbol{\mu }_{\boldsymbol{Y}_{\boldsymbol{X}_i}}, \Sigma _{\boldsymbol{Y}_{\boldsymbol{X}_i}})\) as the aggregation result. For example, in Fig. 2, edge 2 calculates a bivariate Gaussian as the aggregation result for sensors C and D and sends it to the server. Then the server proceeds with the subsequent processing as soon as the processing results of all edges are available.

If a non-duplicated edge fails, the server cannot obtain the processing results during that time. In Fig. 2, due to the failure of edge 1, the server cannot obtain the aggregation result for sensors A and B and cannot proceed with the subsequent processing.

We solve this problem by estimating missing results using the correlation between sensors. Let \(\boldsymbol{X}' \subseteq \boldsymbol{X}\) be the sensors assigned to failed edges (hereafter, referred to as missing sensors) and let \(\boldsymbol{O} = \boldsymbol{X} \setminus \boldsymbol{X}'\) be the sensors assigned to remaining edges (hereafter, referred to as observed sensors). The server obtains a distribution \(\mathcal {N}(\boldsymbol{\mu }_{\boldsymbol{Y_O}}, \Sigma _{\boldsymbol{Y_O}})\) as the aggregation results \(\boldsymbol{Y_O}\) for observed sensors and estimates a distribution \(\mathcal {N}(\boldsymbol{\mu }_{\boldsymbol{Y}_{\boldsymbol{X}'}}, \Sigma _{\boldsymbol{Y}_{\boldsymbol{X}'}})\) as the aggregation results \(\boldsymbol{Y}_{\boldsymbol{X}'}\) for the missing sensors. Then the server continues the subsequent processing without waiting for the recovery of failed edges.

For example, in Fig. 2, the missing sensors are \(\boldsymbol{X}'=\boldsymbol{X}_1=\{X_A, X_B\}\) and the observed sensors are \(\boldsymbol{O}=\boldsymbol{X}_2=\{X_C, X_D\}\). The server can filter those whose aggregation result \(Y_X\) for sensor \(X\in \boldsymbol{X}'\) is less than a threshold \(\theta \) in two steps: First, the server calculates the probability \(P(Y_X < \theta )\) from the Gaussian \(\mathcal {N}(\boldsymbol{\mu }_{Y_X}, \Sigma _{Y_X})\). Then the server checks whether this is greater than or equal to \(\delta \).

In our approach, it is essential to take care of the uncertainty of processing results to estimate the outputs of failed edges accurately. For example, when estimating the output \(\mathcal {N}(\boldsymbol{\mu }_{\boldsymbol{Y}_{\boldsymbol{X}_1}}, \Sigma _{\boldsymbol{Y}_{\boldsymbol{X}_1}})\) of edge 1 from the output \(\mathcal {N}(\boldsymbol{\mu }_{\boldsymbol{Y}_{\boldsymbol{X}_2}}, \Sigma _{\boldsymbol{Y}_{\boldsymbol{X}_2}})\) of edge 2 in Fig. 2, we can consider a method that uses the posterior probability distribution given the mean vector \(\boldsymbol{\mu }_{\boldsymbol{Y}_{\boldsymbol{X}_2}}\) as described in Sect. 3. However, this method does not use the covariance matrix \(\Sigma _{\boldsymbol{Y}_{\boldsymbol{X}_2}}\), which represents the uncertainty of the output of edge 2, and results in an underestimation of the error bounds.

4.2 Aggregation Result Estimation

If we can model the correlation between sensors \(\boldsymbol{X}\) by a Gaussian \(\mathcal {N}(\boldsymbol{\mu }, \Sigma )\), we can also model the correlation between the aggregation results \(\boldsymbol{Y}_{\boldsymbol{X}}\) by a Gaussian. Consider an average query for a time window with the window width w. From the reproductive property of Gaussian, we can model the correlation between \(\boldsymbol{Y}_{\boldsymbol{X}}\) by a Gaussian \(\mathcal {N}(\boldsymbol{\mu }', \Sigma ')\), whose mean vector and covariance matrix are:

$$\begin{aligned}&\boldsymbol{\mu }' = \frac{1}{w} \left( \sum _{t \in [t', t'+w)} \boldsymbol{\mu }^t \right) = \boldsymbol{\mu } \mathrm {\ and} \end{aligned}$$
(7)
$$\begin{aligned}&\Sigma ' = \frac{1}{w^2} \left( \sum _{t \in [t', t'+w)} \Sigma ^t \right) = \frac{1}{w} \Sigma . \end{aligned}$$
(8)

If the server knows that the aggregation results \(\boldsymbol{Y_O}\) are equal to \(\boldsymbol{y}\), the server can calculate the posterior probability distribution \(\mathcal {N}(\boldsymbol{\mu }_{\boldsymbol{Y}_{\boldsymbol{X}'} \mid \boldsymbol{y}}, \Sigma _{\boldsymbol{Y}_{\boldsymbol{X}'} \mid \boldsymbol{y}})\) as the estimated aggregation results \(\boldsymbol{Y}_{\boldsymbol{X}'}\):

$$\begin{aligned}&\boldsymbol{\mu }_{\boldsymbol{Y}_{\boldsymbol{X}'} \mid \boldsymbol{y}} = \boldsymbol{\mu }'_{\boldsymbol{X}'} + \Sigma '_{\boldsymbol{X}' \boldsymbol{O}}\left( \Sigma '_{\boldsymbol{O} \boldsymbol{O}}\right) ^{-1}\left( \boldsymbol{y} - \boldsymbol{\mu }'_{\boldsymbol{O}}\right) \mathrm {\ and} \end{aligned}$$
(9)
$$\begin{aligned}&\Sigma _{\boldsymbol{Y}_{\boldsymbol{X}'} \mid \boldsymbol{y}} = \Sigma '_{\boldsymbol{X}' \boldsymbol{X}'} - \Sigma '_{\boldsymbol{X}' \boldsymbol{O}}\left( \Sigma '_{\boldsymbol{O} \boldsymbol{O}}\right) ^{-1}\Sigma '_{\boldsymbol{O} \boldsymbol{X}'}. \end{aligned}$$
(10)

However, the server cannot know the exact value \(\boldsymbol{y}\) because edges aggregate sensor data approximately.

In our method, the server estimates the aggregation results \(\boldsymbol{Y}_{\boldsymbol{X}'}\) using the Gaussian \(\mathcal {N}(\boldsymbol{\mu }_{\boldsymbol{Y_O}}, \Sigma _{\boldsymbol{Y_O}})\) that represents the uncertainty of \(\boldsymbol{Y_O}\). That is, the server calculates the probability that \(\boldsymbol{Y_O}\) is equal to \(\boldsymbol{y}\) from \(\mathcal {N}(\boldsymbol{\mu }_{\boldsymbol{Y_O}}, \Sigma _{\boldsymbol{Y_O}})\) and marginalizes it:

$$\begin{aligned} P\left( \boldsymbol{Y}_{\boldsymbol{X}'} \right) =&\int P\left( \boldsymbol{Y_O}=\boldsymbol{y} \right) P\left( \boldsymbol{Y}_{\boldsymbol{X}'} \mid \boldsymbol{Y_O} = \boldsymbol{y} \right) d\boldsymbol{y} \end{aligned}$$
(11)
$$\begin{aligned} =&\int f\left( \boldsymbol{y} \mid \boldsymbol{\mu }_{\boldsymbol{Y_O}}, \Sigma _{\boldsymbol{Y_O}} \right) f\left( \boldsymbol{y}' \mid \boldsymbol{\mu }_{\boldsymbol{Y}_{\boldsymbol{X}'} \mid \boldsymbol{y}}, \Sigma _{\boldsymbol{Y}_{\boldsymbol{X}'} \mid \boldsymbol{y}} \right) d\boldsymbol{y}. \end{aligned}$$
(12)

Note that, some edges aggregate measurements of a part of sensors \(\boldsymbol{O}\) separately, so that the distribution \(\mathcal {N}(\boldsymbol{\mu }_{\boldsymbol{Y_O}}, \Sigma _{\boldsymbol{Y_O}})\) of \(\boldsymbol{Y_O}\) is a combination of their outputs. In this paper, since edge i cannot get the observations of sensors \(\boldsymbol{X}_j\) assigned to another edge j, we consider all of the covariances between \(\boldsymbol{Y}_{\boldsymbol{X}_i}\) and \(\boldsymbol{Y}_{\boldsymbol{X}_j}\) in \(\Sigma _{\boldsymbol{Y_O}}\) as 0.

Calculating Eq. (12), we get a Gaussian \(\mathcal {N}(\boldsymbol{\mu }'_{\boldsymbol{Y}_{\boldsymbol{X}'}}, \Sigma '_{\boldsymbol{Y}_{\boldsymbol{X}'}})\), whose mean vector and covariance matrix are:

$$\begin{aligned} \boldsymbol{\mu }'_{\boldsymbol{Y}_{\boldsymbol{X}'}} =&\boldsymbol{\mu }'_{\boldsymbol{X}'} + \Sigma '_{\boldsymbol{X}' \boldsymbol{O}}\left( \Sigma '_{\boldsymbol{O} \boldsymbol{O}}\right) ^{-1}\left( \boldsymbol{\mu }_{\boldsymbol{Y_O}} - \boldsymbol{\mu }'_{\boldsymbol{O}}\right) \mathrm {\ and} \end{aligned}$$
(13)
$$\begin{aligned} \Sigma '_{\boldsymbol{Y}_{\boldsymbol{X}'}} =&\Sigma '_{\boldsymbol{X}' \boldsymbol{X}'} - \Sigma '_{\boldsymbol{X}' \boldsymbol{O}}\left( \Sigma '_{\boldsymbol{O} \boldsymbol{O}}\right) ^{-1}\Sigma '_{\boldsymbol{O} \boldsymbol{X}'} +\Sigma '_{\boldsymbol{X}'\boldsymbol{O}}\left( \Sigma '_{\boldsymbol{O} \boldsymbol{O}}\right) ^{-1}\Sigma _{\boldsymbol{Y_O}}\left( \Sigma '_{\boldsymbol{O} \boldsymbol{O}}\right) ^{-1}\Sigma '_{\boldsymbol{O}\boldsymbol{X}'}. \end{aligned}$$
(14)

Note that these are similar to the posterior probability distribution given the mean vector \(\boldsymbol{\mu }_{\boldsymbol{Y_O}}\) as the aggregation results: the mean vector is equal to Eq. (13) and the covariance matrix is equal to the first and second terms of Eq. (14). The difference is the third term of Eq. (14) which is the correction term of the covariance matrix based on the uncertainty of processing results. It indicates that, unlike simply estimating the result of \(\boldsymbol{Y}_{\boldsymbol{X}'}\) from the mean vector \(\boldsymbol{\mu }_{\boldsymbol{Y_O}}\), our method corrects the covariance matrix \(\Sigma '_{\boldsymbol{Y}_{\boldsymbol{X}'}}\) taking care of the uncertainty to predict the error bounds accurately.

5 Experimental Evaluation

In this section, we demonstrate the validity of the proposed method by experiments. We explain the experimental settings and then describe the experimental results in detail. Note that we omit the results of the sum query because it has the same tendency as the average query.

5.1 Experimental Settings

We use the following two datasets to evaluate the validity of our method.

  • Real. We measure the temperature in our laboratory with 24 environmental sensors 2JCIE-BL [1]. This dataset has tuples for 24 days at 1-minute intervals. We use the data for the first 16 days for model training and the rest for the evaluation. Note that pre-test using mvnormtest [4] revealed that this data does not follow the multivariate Gaussian.

  • Synthetic. We prepare synthetic data to verify the validity of the theoretical error guarantee in our method. The dataset has 11520 tuples correspond to 8 days, which follow the Gaussian obtained from the training data. We use rmvnorm function [5] that generates an n-dimensional random number vector that follows the n-dimensional Gaussian given as input. Note that each tuple has no temporal correlation because we generate tuples independently.

In this experiment, we assume a simple failure scenario and prepare test data in three steps: First, we randomly delete part of the measurements at the drop rate r as data lost due to temporary communication failures between sensors and edges. Next, we split this partially deleted data by the window width w and calculate the multivariate Gaussian for the aggregation results as described in Sect. 3. Finally, we randomly select m failed edges and delete these aggregation results as data lost due to edge failures.

5.2 Methodology

We evaluate the validity of our approach from the perspective of theoretical error guarantee. In our approach, the server estimates the missing aggregation result \(Y_X\) for sensor X as a Gaussian \(\mathcal {N}(\mu _{Y_X}, \Sigma _{Y_X})\) and guarantees its error bounds \(e'\) for a user-specified confidence threshold \(\delta \). It indicates that the true value of \(Y_X\) is theoretically within \(\mu _{Y_X}\pm e'\) with the probability of \(\delta \). Therefore, we consider the aggregation result for the lossless dataset as the true value and define the reliability as the probability that the true value exists within \(\mu _{Y_X}\pm e'\). When the reliability satisfies \(\delta \), we can consider the error guarantee of our approach to be valid.

We evaluate the reliability by varying each parameter: confidence threshold \(\delta \), window width w, drop rate r, and the number of failed edges \(m'\). Note that these default values are \(\delta =0.9\), \(w=10\), \(r=0.3\), and \(m'= 1\), respectively. We compare our method with a method that estimates missing results not taking care of the uncertainty (i.e. calculating Eq. (9) and Eq. (10) with \(\boldsymbol{y}=\boldsymbol{\mu }_{\boldsymbol{Y_O}}\)). We refer to this method as baseline.

5.3 Results

We show the experimental results for the synthetic data in Fig. 3. The dotted lines in the figures show the confidence threshold \(\delta \). The reliability of the baseline is always below \(\delta \), whereas our method satisfies \(\delta \) for all parameter settings. These graphs show that the uncertainty of aggregation results affects the estimation of error bounds and our method can recover missing aggregation results reliably from partial observations.

Fig. 3.
figure 3

Experimental results for synthetic data

Consider the tendency of reliability change for each parameter. We see that the difference in reliability between the baseline and our method increases as the error rate r increases. It is considered that an increase in missing tuples causes an increase in the uncertainty of processing results of each edge, which results in a lack of reliability of the baseline. The difference also increases as the number of failed edges \(m'\) decreases. A decrease in failed edges means an increase in the observed processing results with uncertainty. We find that our method is particularly effective in these scenarios where the impact of the uncertainty increases.

On the other hand, as shown in Fig. 4, our method does not satisfy the confidence threshold for real data. We found that this is due to temperature changes over time. To solve this problem, we have to extend our method to handle temporal correlations. We are currently considering an approach to update the model by applying the Kalman filter [9]. Also, we should pay attention to the difference in daily trends due to human activity. For example, the temperature is kept by the air conditioner when someone is in the room, whereas the temperature decreases gradually when the room is vacant. We can address this problem by simply learning a different model for each situation.

Fig. 4.
figure 4

Experimental results for real data

6 Conclusions

We proposed an approach to guarantee approximate fault tolerance for edge computing environments. In our approach, the server estimates missing aggregation results using the correlation between sensors to continue the processing without waiting for the recovery of failed edges. We also provided a method to estimate missing aggregation results taking care of the uncertainty of observed processing results. The experimental results demonstrated that the importance of handling uncertainty and the validity of our method. However, our method could not estimate error bounds accurately for real data. To solve this problem, we are currently considering extending our method to handle temporal correlations.