1 Introduction

In real-life scenarios, wireless sensor networks in Internet of Things (IoT) environments have been widely utilized in contextual information monitoring and on-line large-scale predictive analytics, including environmental monitoring, forest/marine environmental monitoring, and smart cities intelligence applications. IoT predictive intelligence applications process contextual information captured from a number of dedicated sensor (stationary and/or mobile) nodes (sources of contextual information) with advanced sensing and computing capabilities. Sources sense and monitor, e.g., physical contextual parameters (context) and transmit the collected pieces of context to a central predictive analytics and information processing system (hereinafter referred to as System) using wireless communication technologies, e.g., multi-hop communication. However, the sensory field of the sources, e.g., IoT wireless devices within a city area, has a number of inherent characteristics including uncontrollable environments and topological constraints. Sources are typically powered by batteries and thus having limited energy resources. Moreover, environmental monitoring, IoT smart applications, and on-line statistical analytics applications require efficient, accurate and timely data analysis in order to facilitate (near) real-time critical decision-making, and situation- and context- awareness.

Accurate predictive analytics relies on the quality of context and quality of context inference expressed by meta-information [1], e.g., contextual value validity thresholds, outliers, expiration thresholds, contextual information with enhanced semantics. Raw contextual observations collected from sources, however, may have low quality and reliability due to limited energy and computational resources and harsh deployment environments. Predictive analytic tasks like outliers detection, multivariate regression and classification, information fusion (e.g., aggregation), and situational context inference and reasoning, are in need of high quality of sensed context. Inaccurate observations resulting from sources malfunction need to be corrected or removed [8]. This however yields bias in the extracted knowledge and analytics tasks, e.g., false alarms for fire detection, high prediction error in regression models, incompatible context inference, high misclassification errors, inconsistent reasoning. Machine and Statistical Learning (MSL) methods are adopted for (i) identifying and (ii) (ideally) correcting problematic context (e.g., missing values, obsolete data, and outliers). Such MSL methods are of high importance for knowledge extraction, inference, and decision making over incomplete underlying data [6]. Most MSL techniques, such as neural networks and support vector machines, fail if one or more inputs contains missing values and thus cannot be used for predictive analytics and decision-making purposes [7].

In the state of the art, it is possible to find quite a few IoT monitoring and predictive analytics solutions such as forest monitoring [2], fire-event prediction and classification [3], agriculture monitoring [4], marine environment states prediction [5], watershed prediction systems [20], health states prediction in rivers [21], or energy management solutions to reduce both the amount of resources needed and the atmospheric emissions [22]. The reader could also refer to the survey [23] and the references therein. Sensor networks as the pillars of the contextual information sources promise to revolutionize sensing in a wide range of intelligent application domains because of their reliability, accuracy, flexibility, cost effectiveness [24] and ease of deployment. However, contextual data streams pose a challenge to large-scale predictive analytics because, traditional approaches to quality control cannot efficiently (i) handle large-scale observations and (ii) deal with the demands of real-time processing. There is an increasing need for predictive intelligence methods to check and correct (sensed) context to ensure that is delivered in near real time and is of the highest quality. Time-optimized context quality control expedites post-processing and analytics (e.g., missing values substitutions, concept drift correction) so that the final delivered context is of high quality for further processing regression/classification tasks. This motivated us to introduce an optimally scheduled context quality aware mechanism which improves the quality of the delivered context to the System for near real time predictive analytics and knowledge extraction. The proposed mechanism materializes quality assessment prior to delivery of the context to the System by minimizing the induced bias in statistical inference and/or estimation processes due to problematic sensed context. As it will be shown in the experimental evaluation section, our mechanism delivers contextual information to the System of high quality (e.g., as much non-problematic and accurate data as possible) inducing a relatively small delay compared to solutions that either immediately deliver context or decide on context delivery upon threshold-based rules that do not take into account the quality dynamics of the contextual data.

2 Rationale

The rationale behind the proposed mechanism is to deliver high quality context to the predictive analytics System through a stochastically, optimally controlled (delivery) delay. Within such delay tolerant delivery, the mechanism optimally decides when to deliver context with the highest possible quality, thus, improving predictive analytics tasks. The mechanism delivers context (represented by a row vector) x = [x 1,…,x n ] of n measurements (values), where each x i corresponding to the i-th source, with the least possible problematic pieces of data. We require that System receive good context x in the sense that it consists with as many non-problematic values as possible. This is mandatory since the quality of x affects the predictive analytics tasks for monitoring the state of nature in the receptive field and/or MSL methods for knowledge extraction. We abstract such methods/tasks through a function f(x) over sensed context x, which formulates a MSL/predictive analytics process. For instance, f(x) refers to a statistical metric like mean value, or to a multivariate regression model, e.g., linear regression model f(x) = w x + b,b > 0 with x being the predictor vector and w the learned parameter, or to a classification model, e.g., f(x) = s i g n(w x + b). Inevitably, the more non-problematic values the System receives, the more accurate an analytics process in terms of f(x) can be achieved. Our mechanism attempts to deliver as good context as possible to achieve a high quality of the invoked analytics process. In that sense, we delay the delivery of context to the System in hope of observing a relatively good one to deliver, however at the expense of a certain delay. Figure 1 shows the rationale of the proposed mechanism. The baseline solution is to immediately deliver the current context x to the System not taking into account the context quality semantics.

Fig. 1
figure 1

Overall idea: context x from the sensory field is fed to the optimally quality-driven scheduled mechanism, which either delivers x to the System or waits for possibly high quality context

2.1 Motivation

We report on four real-life cases / scenarios in order to further exemplify our motivation on the application of quality-optimized predictive analytics.

Case 1 [Incomplete Contextual Data]

If, at a given time instance, a portion of the received values to System are problematic, say x 1,…,x m with m < n missing values, then there might be a bias in further processing of x. For instance, consider the deviation on estimating the mean value of nm observed values i.e., \(f(\mathbf {x}) = \frac {1}{n-m}{\sum }_{i=1}^{n-m}x_{i}\) instead of n values \(\frac {1}{n}{\sum }_{i=1}^{n}x_{i}\), or on estimating the order statistics, e.g., f(x)= min(x 1,…,x nm ) instead of min(x 1,…,x n ); recall the ‘effect size’ problem in statistics [25] where the statistical error is proportional to \(1/\sqrt {n-m}\). Moreover, a missing value substitution algorithm (MVA) [26] running on the System, which is able to predict the most plausible values for the m missing values of x, results in higher accuracy when m is relatively smaller compared to n. Hence, a delay in avoiding the delivery of bad x (with a high number of missing values) could be of high importance in terms of accuracy of prediction and, more interestingly, avoiding the MVA invocation each time a bad vector is available, thus eliminating redundant waste of resources [9].

Case 2 [Validity of Contextual Data]

Consider that an analytic task like concept drift detection or novelty detection task that requires its input x to contain a high number of non-expired values. Here we deal with the fact that the validity of each value x i is characterized by an expiration window. That is, for each value x i there is an expiration indicator I i (x i )=1 if x i is a valid value; 0 otherwise (i.e., expired value). The mechanism has to ‘delay’ the delivery of x to the detection algorithm by attempting to find a better vector of n values at some unknown time in the future, which maximizes the \(f(\mathbf {x}) = {\sum }_{i=1}^{n}I_{i}(x_{i})\), i.e., context that contains a high number of valid values.

Case 3 [Contextualized Inference]

Contextual data fusion processing has gained significant importance [10]. Contextual data fusion refers to the problem of combining diverse and conflicting contextual information provided by sources, in a consistent and coherent manner [11]. The objective of the contextualized inference is to infer a sub-taxonomy of situations (from the very abstract to the very specific) of a system that is being observed or taxonomy of activities being performed [13, 14]. Specifically, contextualized inference methods [14] are generally applied in situation- and context-aware systems [16, 17], where a more specific situation (positioned at the lower levels of the situational taxonomy) is represented by the logical conjunction of situational components [12, 15]. Let us adopt the by far popular IF-THEN situational knowledge representation inference rule, i.e., the logical conjunction \(f(\mathbf {x}) = \bigwedge _{i=1}^{n}(f_{i}(x_{i})) \in \{\text {\texttt {TRUE}, \texttt {FALSE}}\}\) of n logical operators f i (x i ) over aggregated (or not) values x i , e.g., the situational component f i (x i )=TRUE if \(x_{i} \in [x^{low}_{i},x^{high}_{i}]\); FLASE, otherwise. That is, f(x) is envisaged as an IF-THEN situational rule for evaluating the current situational context given the current context x. A predictive analytics system caters for inferring the most specific situation within a situational taxonomy. That is, situation f(x) conveys more information to the system than situation f (x) iff one can deduct f (x) from f(x), i.e., f(x) contains more TRUE situational components than the f (x). Such a situation-aware system has to ‘delay’ its situational inference by observing as much true facts, i.e., components with {x i = TRUE}, as possible to reason about more specific situations, which further activates more specific actuation rules and decisions, compared with the ‘trivial’ abstract situations, i.e., those containing a high number of {x i = FALSE} components.

Case 4 [Progressive and Maintenance Analytics]

The author would like to mention the prior work [18] and [19] on dealing with the optimal maintenance of the top-k list of objects over incomplete multivariate data streams and intelligent progressive Big Data analytics. The work [18] refers to an intelligent scheduling of top-k list maintenance with the purpose of increasing the quality of the delivered list to a analytics back-end system. Generally speaking, in this case the f(x) abstracts the degree of updates of sequential partial results x from merged top-k lists. Hence, a predictive analytics system ‘delays’ its final top-k list maintenance based on the up-to-now seen quality of partial results. The work in [19] deals with continuous queries over a distributed federation of data nodes and returns the final outcome to users or analytics applications. The system based on the current quality of the up-to-now retrieved partial results (abstracted by a non-trivial f(x) over partial results x) engages a sub-set of query processors to further execute the issued queries. In both analytics systems, one has to define an optimally scheduled mechanism over queries to provide optimal decisions on when to invoke a maintenance process [18] or further analyzing data given analytics queries [19].

In all these real-life cases, the predictive analytics system requires more information or quality information in order to proceed with an analytics task, e.g., either situational inference, aggregation, or classification tasks. However, a delay in the delivery of vectors x to the System incurs some penalty, especially when dealing with real time predictive analytics as in the above mentioned cases. On the one hand, we require immediate consumption of the observed pieces of context x by the predictive analytic tasks. On the other hand, we require a high quality of the analytics / prediction / classification results, which fundamentally relies on the quality of the received pieces of context, i.e., the input to the System. We attempt to reduce the redundant invocations of predictive analytics tasks with inputs of low quality, which inevitably lead to ‘biased’ inference and statistical reasoning results. Evidently, there is a trade-off between delaying the consumption of the observed context (thus feeding the System with high quality of context) and the near real time processing associated with a delay-tolerant predictive analytics process. The problem here is to determine when to deliver high quality context balancing between quality of analytics results and near real time predictive analytics.

2.2 Contribution & organization

The contribution of this paper is an analytical stochastic optimization mechanism, which monitors streams of pieces of context and optimally determines when to deliver context of high quality to the System for predictive analytics. Such mechanism is based on the principles of the theory of optimal stopping [27] through which we derive an optimal decision time to ‘stop’ observing the contextual data stream and to ‘deliver’ context such that the expected predictive analytics quality is maximized given a certain cost per observation. The theory of optimal stopping [27] is proved to be very efficient in cases where we try to find the appropriate decision time instance to stop the observation of a stochastic process with the objective of maximizing our payoff or reward. Naturally, we build our mechanism on the principles of the optimal stopping theory to maximize the quality of predictive analytics results by inducing a controlled delay. Through this delay we attempt to balance between immediate and delayed predictive analytics in hopes of observing higher quality pieces of contextual information as illustrated in Cases 1–3. The outcome of the mechanism indicates whether we should stop observing the quality of the context streams and activate a predictive analytics and/or MSL method, or to continue. This delay-tolerant activation supports intelligent analytics applications that can tolerate some delay in hopes of obtaining high quality results, like: (i) progressive query analytics applications in large-scale distributed systems [19], (ii) results maintenance of rank-based queries over data streams [18], (iii) efficient networking analytics applications for location-based services [34], (iv) efficient and progressive recommendations of recommendation systems and applications [35], (v) efficient user’s mobility and trajectory patterns extraction in mobile computing environments [36], (vi) quality information forwarding and dissemination in mobile applications over IoT environments [37, 39, 40], and (vii) security analytics for location-privacy [40].

As it will be shown in the performance assessment, our mechanism provides a wide range of quality results, ranging between medium quality results with almost zero delay and high quality results with an acceptable delay. Through this delay (in terms of the application tolerance), the System saves computational resources and eliminates redundant activations of MSL methods/analytics tasks.

The contribution of this work is summarized as follows:

  • A novel stochastic optimization mechanism which decides when a predictive analytics task should be activated over large-scale contextual data streams by guaranteeing the highest possible quality results.

  • An analytical model under the principles of the optimal stopping theory that derives the optimal time for activating the predictive analytics tasks.

  • Comprehensive experimental results showcasing the benefits of our mechanism to real life intelligent predictive analytics applications over real contextual data involving widely applied aggregation analytics vis-à-vis the threshold-based and immediate context delivery approaches.

The paper is organized as follows: Section 3 introduces the concept of context quality for data streams of (possibly problematic) contextual data and some preliminaries in the theory of optimal stopping. Section 4 formulates and provides a solution to the quality-optimized mechanism for the considered stochastic optimization problem. Section 5 reports on the experimental results of our mechanism through a sensitivity analysis of the basic parameters and provides a comparative assessment with threshold-based and immediate context delivery rules over real sensors contextual data. Finally, Section 6 concludes the paper and discusses future research on that topic.

3 Definitions

Table 1 refers to the nomenclature.

Table 1 Nomenclature

3.1 Quality of contextual information

Consider a discrete time domain \(\mathbb {T} = \{1, 2, {\ldots } \}\) such that x = [x 1,…,x n ] contains real values \(x_{i} \in \mathbb {R}\) at time \(t \in \mathbb {T}\) for each dimension i ∈ 1,…,n (or in a compact notation i ∈ [n]). We assume that x i at time t refers to the measurement of source i or the aggregation result over K measurements x i1,…,x i K launched on source i, K > 0. (The value x i j could refer to a measurement of the j-th neighboring node in the spatial neighborhood of source i, j ∈ [K].) Each measurement x i is received instantly and that a new possible value might be received from the same source i only at the next time slot t + 1, i.e., in the interval [t,t + 1) source i reports only once or not at all.

We proceed with a generic model representation to capture the idea of a good piece of context x. Specifically, the characterization of x as a ‘good’ piece of context intuitively indicates that x contains a relatively high number of good values, e.g., a percentage of 75 % of the n values of context x refers to non-missing values. A ‘good’ value x i at time t means, for instance, that x i is a valid value, a non-incomplete value, or a TRUE fact/situation, i.e., I i (x i )=1 as discussed in Cases 1 and 2 or I i (x i )=TRUE in Case 3, while I i (x i )=0 indicates a bad value, or a missing datum (Cases 1,2) or a situation does not hold true (I i (x i )=FALSE in Case 3). Or, if x i is observed at time t thus not being missed as discussed in Case 1, then x i is called a good value, otherwise it is called a bad value, i.e., a missing value. Based on all these interpretations, we provide the following definitions:

Definition 1

The quality indicator of the i-th measurement (i.e., from the i-th source) is define as the random variable (r.v.) \({X_{t}^{i}}\) such that:

$$\begin{array}{@{}rcl@{}} {X_{t}^{i}} = \left\{\begin{array}{ll} 1 (\text{\texttt{TRUE}}) & \text{ with probability } \beta_{i}\\ 0 (\text{\texttt{FALSE}}) & \text{ with probability } 1-\beta_{i}, \end{array}\right. \end{array} $$
(1)

where a zero value, i.e., \({X_{t}^{i}} = 0\), indicates a bad value of dimension i at time t while a value \({X_{t}^{i}} = 1\) refers to a good value x i at t.

The r.v. \({X_{1}^{i}}, {X_{2}^{i}}, \ldots \) are independent and identically distributed (i.i.d.). with expectation E[X i]=1⋅P(X i=1)+0⋅P(X i=0) = β i > 0 given that β i ∈(0,1),i ∈ [n]. The value of β i can be estimated by historical data and/or combined with information provided by the manufacturer of source i, e.g., quantifying sensor node degree of reliability of measurement. (Remark 2 provides an estimation of the β parameter.) Each time t the mechanism observes context x and does not immediately deliver it to the System, we encounter fixed a (delay) cost of observation c > 0.

Definition 2

We define as quality reward of context x at time t the r.v. Y t , which refers to the quantity of the good values \(M_{t} = {\sum }_{i=1}^{n}{X_{t}^{i}}\) minus the total observation cost up to time t, i.e.,

$$\begin{array}{@{}rcl@{}} Y_{t} & = & \sum\limits_{i=1}^{n}{X_{t}^{i}} - t \cdot c = M_{t} - t \cdot c. \end{array} $$
(2)

3.2 Preliminaries on the optimal stopping theory

The theory of optimal stopping [27, 28] is concerned with the problem of choosing a time instance to take a certain action, in order to minimize an expected loss (or maximize an expected payoff). A stopping rule problem is associated with:

  • a sequence of random variables (r.v.) M 1, M 2,…, whose joint distribution is assumed to be known and

  • a sequence of payoff (reward) functions (Y t (M 1,…,M t ))1≤t which depend only on the observed values of the corresponding r.v.s M 1,…,M t .

The available information up to t is a sequence \(\mathcal {F}_{t}\) of values of the r.v.s M 1,…,M t (a.k.a. filtration). The optimal stopping rule problem is defined as follows: We are observing the sequence of the r.v.s (M t )1≤t , and at each time instance t, we can choose to either stop observing or continue. If we stop observing at time instance t, we get reward Y t . We desire to choose a stopping rule or stopping time to maximize our expected reward.

Definition 3

An optimal stopping rule problem is to find the stopping time T which maximizes the expected reward, i.e., \(E[Y_{T}] = \sup _{0 \leq t \leq \mathcal {T}} E[Y_{t}]\). Note, \(\mathcal {T}\) might be .

4 Time-optimized quality-driven mechanism

The mechanism observes the sequence of r.v. M 1, M 2,…,M t without delivering the corresponding pieces of context x 1, x 2,…,x t to the System. Our aim is to find the best strategy in the sense of having the highest expected quality reward E[Y] at the lowest cumulative cost of delay. At each time t we only need to decide:

  • whether to deliver x t to the System, thus, proceeding with a predictive analytic task over f(x t ) or

  • to continue with the next observation x t + 1 without delivering x t to System, thus, delaying the predictive analytic task.

Hence, a strategy is a function which assigns to each sequence M 1, M 2,… a stopping time. Furthermore, since we cannot see the future, a decision to stop observation at time t can only depend upon M 1, M 2,… Formally we have to solve the following problem:

Problem 1

Given the sequence of sums of quality indicators M 1,…,M t , find the optimal stopping time T which maximizes E[Y T ]= sup0≤t < E[Y t ].

The idea is to find a criterion at time instance t such that given the current value of M t , denoting the current quality of context observed at the mechanism, the latter immediately decides whether to deliver x t to the System or to continue to the next observation. We require an immediate decision making over the contextual data streams, thus, avoiding any redundant computations. As it will be shown in the remainder, the mechanism at time instance t proceeds with a time-optimized decision in O(n) time involving simply the counting of quality indicators \({X_{t}^{i}}\) from all n sources, i ∈ [n].

In order to solve Problem 1, we rest on the principle of optimality. Specifically, let T be the optimal stopping time where the supremum in our Problem 1 is attained, i.e., E[Y T ] = V with V = supt E[Y t ]. We can now provide the optimality equation given the filtration \(\mathcal {F}_{t}\), i.e., after observing M 1,…,M t , as follows:

Theorem 1

Let T be an arbitrary stopping time and \(V^{*}_{t} = \sup _{T\geq t}E[Y_{t}|\mathcal {F}_{t}]\) . Then, \(V^{*}_{t} = \max (Y_{t}, E[V^{*}_{t+1}|\mathcal {F}_{t}])\)

Proof

See [28] \(\square \)

The optimal stopping time T given by the principle of optimality from Theorem 1 is represented by the rule:

$$ T = \min\{t \geq 0 | Y_{t} = V^{*}_{t}\}. $$
(3)

Let us put the reward Y 0 = − to force our mechanism to take at least one observation. Also, we put Y = − as naturally the cost of an infinite number of observation is infinite. Consider now the V the expected quality reward for the System based on an optimal stopping rule in (3). Suppose that the mechanism induces cost c and observe the M 1. Note that if the mechanism continues from this point then quality M 1 is ‘lost’ and the cost c is already paid. Hence, it is just like starting the problem over again. That is, if the mechanism continues from this point, the System can obtain an expected quality reward of V but no more. Therefore, from the principle of optimality in Theorem 1 we derive that if M 1 < V then the mechanism should continue; if M 1 > V , then the mechanism should stop and deliver context to the System. For M 1 = V both decisions are optimal; we adopt here a stopping decision. This argument is made at any stage t by the mechanism, thus, in our case we provide the optimal stopping rule, which is adopted by the mechanism, as follows:

Theorem 2

Given the sequence M 1 ,…,M t there is a real number y=V such that the optimal stopping time T is given by T= min{t ≥ 1|M t >y} with E[Y T ]=y.

Proof

The r.v. \(M = {\sum }_{i=1}^{n}X^{i}\) takes realization discrete values from {0,1,…,n}. Now, at the optimal stopping time t = T, i.e., the first time at which M t > y, we obtain \(E[Y_{T}] = E[M_{T}]- E[T]c = {\sum }_{i=1}^{n}E[{X_{T}^{i}}] - E[T]c\). Moreover, let γ = P(M > y) and δ = 1 − γ = P(My). Then, we obtain

$$\begin{array}{@{}rcl@{}} E[M_{T}] & = &\sum\limits_{k=1}^{\infty}E[M_{k}| M_{k} > y, M_{1} \leq y, {\ldots} M_{k-1} \leq y] \\ & = & \sum\limits_{k=1}^{\infty}E[M_{k} | M_{k} > y]\delta^{k-1} = E[M_{1}| M_{1} > y]\frac{1}{\gamma}. \end{array} $$

The quantity \(E[M_{T}] = \frac {1}{\gamma }E[M_{1}|M_{1} > y]\) indicates that at the optimal stopping time T, the expected context quality equals to the expected context quality given that the latter is above the criterion threshold y = V . In addition, for the optimal stopping time T we obtain

$$\begin{array}{@{}rcl@{}} E[T] & = & \sum\limits_{k=1}^{\infty} k P(M_{k} , M_{k} > y, M_{1} \leq y, {\ldots} M_{k-1} \leq y) \\ & = & \gamma \sum\limits_{k=1}^{\infty} k \delta^{k-1} = \frac{1}{\gamma}. \end{array} $$

The problem now is to compute y = V . This is done through the optimality equation in Theorem 1 and the above mentioned argument, i.e.,

$$\begin{array}{@{}rcl@{}} V^{*} & = & E[\max(M_{1},V^{*})] - c \Leftrightarrow \\ c & = & E[(M_{1}-V^{*})^{+}] \end{array} $$

That is a quality reward E[Y T ] is obtained at the optimal stopping time T with quality reward greater than y and y is the solution of the E[(M 1y)+] = c, with (xy)+= max(0,xy)

Hence, having an y such that \(c =E[\max (0,M_{1}-y)] = E[({\sum }_{i=1}^{n}{X_{1}^{i}} - y)^{+}]\), we obtain

$$\begin{array}{@{}rcl@{}} E[Y_{T}] & = & E[M_{T}] - \frac{1}{\gamma}c \\ & = & \frac{1}{\gamma} \left( E[M_{1}| M_{1} > y] - E[M_{1} - y| M_{1} > y]\right)\\ & = & \frac{1}{\gamma}E[y]P(M_{1} > y) = y. \end{array} $$

Hence, the optimal stopping time T achieves the maximal expected quality reward E[Y T ] = y. □

Remark 1

The optimal rule in Theorem 2 is optimal for our problem since E[(My)+] − c is monotonically non-decreasing with M for M > y almost surely and E[(My)+] is continuous in y and decreasing from + to zero. Hence there is a unique solution for y for any c > 0.

The mechanism stops the observation process of pieces of context and delivers context x t at the first time instance t at which the quantity of the good values M t is above a threshold \(y \in \mathbb {R}\), which refers to the highest quality of reward that can be obtained. The problem now reduces on the evaluation of the y value such that E[(M 1y)+] = c. The algorithm of our mechanism is shown in Fig. 2. The input of the algorithm is the stopping criterion y. At each received context x t , the mechanism calculates M t and decides whether to deliver x t to the System or not. In the former case, the mechanism start-off with the next sequence of (M t ). Evidently, the computational time for evaluating the criterion M t > y is O(n).

Fig. 2
figure 2

Algorithm of the quality-optimized mechanism

We proceed our analysis with the case where β i = β for all sources, i ∈ [n]. If we notate Z = max(My,0) and F M (y) = P(My) be the cumulative distribution function of M then y is the solution of E[Z] = c. We have that E[Z] = E[My|M > y](1 − P(My))=(E[M|M > y] − y(1 − F M (y)))(1 − F M (y)). In this case, \(M = {\sum }_{i=1}^{n}X^{i}\) is a Binomial random variable with parameters (n,β). Hence, we obtain \(F_{M}(y) = {\sum }_{j=0}^{\lfloor y \rfloor } \dbinom {n}{j} \beta ^{j}(1-\beta )^{n-j}\). Moreover, we have that \(E[M | M>y] = {\sum }_{m=0}^{n}m P(M=m | M > y)\) or

$$E[M | M > y] = \frac{1}{1-F_{M}(y)} \sum\limits_{m=y+1}^{n} m \dbinom{n}{m} \beta^{m} (1-\beta)^{n-m}. $$

Hence, the expectation of Z is:

$$ E[Z] = \sum\limits_{m=y+1}^{n} m \dbinom{n}{m} \beta^{m}(1-\beta)^{n-m} - y(1-F_{M}(y))^{2} $$
(4)

Based on the criterion E[Z] = c and on (4), we can find analytically the value of y. However, the assumption β i = β,∀i does not spoil the theoretical results and is adopted for eliminating the computations of F M (y) for solving E[(M 1y)+] = c. Obviously, when β i β j , i,j ∈ [n] then F M (y) is provided in [29] (a.k.a. Poisson-Binomial distribution) thus, we can obtain the corresponding value for y.

Remark 2

The probability β of a non-problematic piece of contextual value X i can be incrementally estimated by the maximum likelihood estimation of β of the Binomial distribution with parameters (n,β) after observing a series of m pieces of context \((\mathbf {x}_{t})_{t=1}^{m}\), m > 1. Specifically, recall that the probability density function for the Binomial is \(\dbinom {n}{M}\beta ^{M}(1-\beta )^{n-M}\) with M = 0,…,n. Hence, the log-likelihood \(\mathcal {L}_{m}(\beta )\) of a series of m samples of M 1,…,M m is

$$\begin{array}{@{}rcl@{}} \mathcal{L}_{m}(\beta) &=& \sum\limits_{i=1}^{m}\ln \dbinom{n}{M_{i}} + \ln \beta \sum\limits_{i=1}^{m}M_{i}\\ &&+ \left( nm - \sum\limits_{i=1}^{m}M_{i} \right) \ln (1-\beta). \end{array} $$

Since \(\mathcal {L}_{m}(\beta )\) is a continuous function of β given m observations, i.e., β = β m , its maximum value derives from the derivative of \(\mathcal {L}_{m}(\beta )\) with respect to β m by setting it equal to zero, i.e., \(\frac {\partial \mathcal {L}}{\partial \beta _{m}}=0\). After this calculation, we obtain that up to the m-th observation, the probability β m is: \(\beta _{m} = \frac {1}{nm}{\sum }_{i=1}^{m}M_{i}\). Hence, we can incrementally estimate the β m value by the previous β m−1 and the current value of M m by using the recursion \(\beta _{m} = \frac {m-1}{m} \beta _{m-1} + \frac {1}{nm}M_{m}\), with \(\beta _{1} = \frac {1}{n}M_{1}\). After a series of m observations, we can learn the β = β m and then initiate our mechanism.

5 Experimental evaluation

5.1 Sensitivity analysis

5.1.1 Simulation setup

We study the performance of the proposed Optimal Delivery Approach (ODA) on both analytical model and simulations with respect to the basic parameters, i.e., probability of a good value β, number of sources n, and cost per observation c. We also provide a comparative assessment with a Threshold-based Delivery Approach (TDA) on deciding when to deliver context to System for further processing. Specifically, TDA choses a threshold 𝜃 ∈ {1,n} and delivers context x at the first time t at which M t 𝜃. That is, when context x has at least 𝜃 (out of n) non-problematic values, then TDA immediately delivers x to System.

We define as ‘epoch’ the number of pieces of context an approach (ODA, TDA) has observed until it decides to deliver the current context to the System. Each time t context x t is delivered to System, then a new epoch for the approach starts-off. TDA at the beginning of each epoch choses a threshold 𝜃 uniformly at random from {1,n}, while ODA for every epoch applies the threshold y as estimated using (4). In the j-th epoch we measure the quality reward \(Y_{t_{j}}\) when an approach (ODA, TDA) delivers context \(\mathbf {x}_{t_{j}}\) at stopping time t j . We run experiments for N = 104 epochs, thus obtaining the average value of Y, i.e., \(E[Y] \sim \frac {1}{N}{\sum }_{j=1}^{N}Y_{t_{j}}\) for both approaches.

5.1.2 Performance assessment

Figure 3(left) shows the impact of probability β on the average quality reward E[Y] with different cost values c for the analytical model and the simulation results using n = 30; we obtain similar results for other n values. It is worth mentioning how accurately the simulation curves fit with the analytical model curves for all parameter values, denoting the capability of the proposed model for predicting the average quality reward given β and c values. Moreover, we observe that as β increases then we obtain higher quality rewards, as expected, since we deal with less problematic pieces of data. With the term problematic piece of data, here, we denote that the context vector x contains more non-missing values than missing values. Statistically, for β > 0.5, context x is less problematic than a piece of context x , with β < 0.5, since the former contains, at least, more non-missing values than the latter one. That is, in context x, over 50% of the n values are non-missing given that each value is non-missing with probability over 0.5 by expectation of the Binomial distribution ∼B(n,β). Also, the impact of the delay cost on E[Y] is low compared to the impact of β especially when c > 0.5.

Fig. 3
figure 3

(Left) Quality reward E[Y] against probability β for analytical model and simulations with different cost c and n = 30; (right) average delivery delay of the proposed approach, i.e., E[T], against cost c for different n values with β = 0.8

In Fig. 3(right) we plot the average delay E[T] against the cost c for different values of n with β = 0.8. E[T] indicates the average number of observations that the mechanism neglects in each epoch before stopping and then delivering context to the System for predictive analytics. As shown in Fig. 3(right), a relatively small delay is tolerated in order to proceed with delivering context of high quality. This indicates the applicability of the proposed ODA to near real-time predictive analytics. Moreover, as the cost per observation decreases then a relatively higher delay is encountered, since low cost c gives the ‘opportunity’ to the mechanism to observe more pieces of context before stopping at a good one, thus, increasing the likelihood of receiving context of high quality. On the other hand, a high cost value reinforces the mechanism to stop (and thus deliver context) at an early stage of each epoch. For instance, for c = 0.8 the mechanism, on average, delivers the second received context to System. By tuning the cost we can control the degree of tolerance of the statistical analytics process, with c → 1 indicating a very conservative system, while c → 0 indicating high tolerance to information processing.

Let us define the Normalized Quality Indicator (NQI) \(\frac {1}{n}M_{t}\) of an approach which evaluates the quality of the delivered context x t when stopping at time t within an epoch. Recall that M t indicates the number of non-problematic values that context x t contains with 0≤M t n. Hence a high NQI value close to unity denotes delivered context of high quality. Figure 4 illustrates the average NQI for the ODA (for all epochs, i.e., \(\frac {1}{n}E[M] \sim \frac {1}{nN}{\sum }_{j=1}^{N}M_{t_{j}}\)) against number of sources n for different cost values c and β ∈ {0.1,0.8}. It is worth noting that the NQI of an approach that stops the observation process at an arbitrary time and, then, delivers context at that time to the System is \(\frac {1}{n}E[M] = \frac {1}{n}\beta n = \beta \), where E[M] = β n is the expectation of the Binomial distribution ∼B(n,β); we notate this value as BNQI. This approach does not take into consideration the sequence of the r.v. M 1,…,M t−1 in order to proceed with a decision at stopping at time t. On the other hand, ODA takes into account the sequence \((M_{t})_{t=1}^{T}\) thus exploiting the knowledge up to T and then obtaining always higher values than BNQI, even for high cost values as shown in Fig. 4. In addition, NQI for relatively medium/high cost values does not depend on the number of sources n, which means that E[M] increases linearly with n. Note also that the higher the β value, the higher NQI gets since the received context is of high quality, while as β → 0 then NQI comes with lower values. However, in that case, NQI is always higher than BNQI indicating the applicability of ODA in cases where the received context contains a high portion of problematic values. Indicatively, for β = 0.01 we obtain NQI = 1.16 and BNQI = 0.01, i.e., our approach delivers two orders of magnitude more quality context with n = 30,c = 0.1. Nonetheless, we have to evaluate the performance of the ODA including also the incurred delay, i.e., E[T], required to proceed with context delivery of high quality. We compare the expected quality reward E[Y] for both approaches (ODA / TDA) for certain values of c, β and n.

Fig. 4
figure 4

The NQI and BNQI against number of sources n for different cost c with (left) β = 0.8 and (right) β = 0.1

Tables 2 and 3 show the average reward E[Y] for both approaches against cost per observation c and probability β with n ∈ {30,50}, respectively. E[Y] quantifies the quality of context delivered when an approach stops at a stopping time t accounting also the cumulative cost for observing t pieces of context. ODA achieves always higher E[Y] value than TDA for all parameters. More interestingly, ODA is deemed appropriate for adopting for delay-tolerant predictive analytics when context contains a high portion of problematic values, i.e., low β values, compared with the performance of TDA. We can observe that for β = 0.1 and, especially, when the cost of observation is relatively high, i.e., c = 0.8, ODA delivers context of (112, 129) % more quality compared to TDA in terms of quality reward with n = (30,50). Moreover, as β increases then ODA and TDA proceed with relatively high E[Y]. This is due to the fact that high β values refer to received context of high quality, thus, evidently both approaches would deliver high quality context. However even in this case, ODA outperforms TDA. When the cost of observation is relatively high and the received context contains a low portion of problematic values, ODA is 84 % and 48 % more efficient than TDA in terms of quality reward for n = 30 and n = 50, respectively; see Tables 2 and 3.

Table 2 Average quality reward E[Y] for ODA and TDA with n = 30
Table 3 Average quality reward E[Y] for ODA and TDA with n = 50

Overall, ODA delivers high quality context to the System, thus, improving the quality of predictive analytics, even when context contains, with a high probability, problematic values and the cost per observation is not negligible. This is attributed to the fact that ODA exploits the history of the observed sequence of M t and then decides on the optimal stopping time to deliver context at the expense of a controlled (relatively low) delay.

5.2 Comparative assessment

5.2.1 Experiment setup

We experiment with real contextual data from K = 16 chemical sensors exposed to three gases of three chemical compounds at a certain concentration level [32, 33]. Each sensor detects three specific environmental contextual parameters corresponding to Ethylene, Ammonia, and Toluene, respectively. Each sensor k ∈ [K] measures a triplet s k = [x k1, x k2, x k3], where each dimension of s k corresponds to the three contextual parameters. The context is then a n-dimensional vector with n = 3K = 48 dimensions at time instance t, i.e., x t = (s 1, s 2,…,s K ) and the dataset contains 13,910 48-dimensional contextual vectors. We focus in the case where there are missing values for each dimension of the context vector at time instance t. For experimentation, we set the probability of a missing (problematic) value in a dimension with p = 1 − β ∈ {0.1, 0.3, 0.5, 0.7, 0.9}, i.e., the probability of being a parameter non-problematic at time instance t is β = 1 − p.

We consider two scenarios. In the first scenario (Scenario 1), the System processes the delivered context vector x t , which might include missing values. The process of the System refers to a fusion operator over the contextual values of the vector (described later). In the second scenario (Scenario 2), the System before processing the context vector x t invokes a Missing Value substitution Algorithm (MVA) for handling the missing values in x t . After the invocation of the MVA, the System calls for a fusion operator over the ‘imputed’ contextual values. The process of the System over context x refers to two fusion operators over the contextual data. For demonstration, we define two vectorial fusion operators: f a v g (x) is associated with the mean value of each chemical compound over all K sensors, and f min(x) is associated with the minimum value of each chemical compound over all K sensors, as follows:

$$\begin{array}{@{}rcl@{}} f_{avg}(\mathbf{x}) & = & \bigg[ \frac{1}{K}\sum\limits_{k=1}^{K}x_{kj} \bigg], j = 1,2,3 \end{array} $$
(5)
$$\begin{array}{@{}rcl@{}} f_{\min}(\mathbf{x}) & = & \bigg[\min\limits_{k \in [K]}\{x_{kj}\} \bigg], j=1,2,3 \end{array} $$
(6)

Scenario 1

In this scenario, when a dimension x k j is missing, k ∈ [K],j = 1, 2, 3 then, evidently, the operators f a v g and f min do not take into account that dimension in the calculation of the mean or the minimum, respectively; note, there is not MVA invocation in this scenario. We experiment with three approaches (we repeat the ODA and TDA for convenience):

  • The Optimal Delivery Approach (ODA), which observes M t and delivers x t when M t > y. Then the System invokes the vectorial operators f a v g (x t ) (and f min(x t )). Otherwise, the System takes the next observation, i.e., the next incoming context vector.

  • The Immediate Delivery Approach (IDA), which delivers context x t at each time instance t to the System. Then, the System invokes at each t the vectorial operators f a v g (x t ) (and f min(x t )).

  • The Threshold-based Delivery Approach (TDA) with threshold parameter 𝜃 ∈ (0,n), which observes M t and delivers x t when M t > 𝜃. Then the System invokes the vectorial operators f a v g (x t ) (and f min(x t )). Otherwise, the System waits for the next time instance to process the incoming vector.

The comparative assessment in Scenario 1 is to examine whether the ODA compared with the delay of the TDA and the non-delay of the IDA results to accurate fusion results. Specifically, if \(\mathbf {x}^{\prime }_{t_{i}}\) is the delivered context to the System by an approach at some time instance t i within the i-th epoch, i = 1,…,N, and x t is the ground truth (actual) context at that time instance (i.e., without missing values), then we define as mean fusion error for the f a v g (⋅) operator as the root mean squared error of the vectorial fused vector, i.e.,

$$ e_{avg} = \left( \frac{1}{N} \sum\limits_{i=1}^{N} \parallel f_{avg}(\mathbf{x}_{t_{i}})-f_{avg}(\mathbf{x}^{\prime}_{t_{i}}) \parallel^{2} \right)^{1/2} $$
(7)

The fusion error e min for the f min(⋅) operator is similarly defined and N is the total number of epochs for each approach. Moreover, we have to include the corresponding expected delay ω a v g (and ω min) of context delivery to the System by an approach (ODA,TDA,IDA) to obtain a certain fusion error. Evidently, the delay for the IDA is zero, since it immediately delivers context to the System for fusion. The expected delay for both ODA and TDA is defined as:

$$ \omega = \frac{1}{N}\sum\limits_{i=1}^{N}t^{*}_{i}, $$
(8)

where \(t^{*}_{i}\) refers (i) to the optimal stopping time T for the i-th epoch in the ODA, i.e., the first time instance at which \(M_{t_{i}} > y\) and (ii) to the threshold-based stopping time for the i-th epoch in the TDA, i.e., the first time instance at which \(M_{t_{i}} \geq \theta \) for a specific 𝜃.

Scenario 2

In this scenario, when a dimension x k j is missing then its value is filled-in (a.k.a. imputed) by the Exponential Smoothing MVA (ES-MVA) [30] with smoothing factor a ∈ (0, 1), which is used in time-series contextual data. Specifically, if the dimension x k j,t at time instance t is missing, which corresponds to sensor k ∈ [K] and to the chemical compound j ∈ {1, 2, 3}, then the ES-MVA replaces it with an estimate u k j,t based on x k j,t−1 and the trajectory of this dimension up to t−1, that is:

$$ u_{kj,t} = a x_{kj,t-1} + (1-a) u_{kj,t-1}, $$
(9)

with u k j,1 = x k j,0. The smoothed statistical estimate u k j,t for the corresponding missing value x k j,t is a weighted average of the previous observation x k j,t−1 and the previous smoothed statistical estimate u k j,t−1. In this scenario, the three approaches ODA, TDA and IDA deliver the context x t to the System as described in Scenario 1. Nonetheless, the System upon receiving the x t vector it firstly involves the ES-MVA for imputation and then invoking the fusion operators f a v g (x t ) and f min(x t ) of the imputed vector x t . Moreover, the fusion errors e a v g and e min in this scenario is defined as in Scenario 1 by simply involving the imputed contextual values.

5.2.2 Comparison evaluation

Tables 4 and 5 show the fusion errors e a v g and e min for the f a v g and f min operators, respectively, and the corresponding delay ω (shown within parenthesis) with β ∈ {0.5,0.7} using the approaches ODA, IDA, and TDA repeated for N = 104 epochs. The results are produced with observation cost c = 1; similar results are obtain with other c values. The ODA achieves the lowest error compared to IDA for all cases with a relatively small delay, i.e., number of observations until the mechanism delivers context to the System. This indicates the applicability of our approach for near real-time predictive analytics, by achieving low fusion error compared with the IDA, which achieves 100% higher fusion error by immediately delivering context. Moreover, we experiment with different threshold values for the TDA, i.e., 𝜃 = η n, with different η ∈ {0.1,…,0.9} percentage. Evidently, the lower the threshold, i.e., the TDA stops at the first time instance the percentage of non-problematic values out of n is over η, the sooner that mechanism delivers context to the System. As shown in Tables 4 and 5, TDA achieves higher fusion error than ODA with relatively higher delay. Specifically, with η≤0.5, ODA outperforms TDA in both error and delay. On the other hand, for η > 0.5, i.e., TDA considers stopping when at least more than 50 % of the contextual values are non-problematic, it achieves lower fusion error compared to ODA. However, this comes at the expense of a significantly high delay (indicatively % for η = 0.7). This high delay is prohibitive for (near) real-time statistics analytics, especially in the environmental monitoring, since significant events cannot be captured at the early stages of a monitoring process, e.g., fire or flood detection. Evidently, as β increases all approaches obtain relatively lower fusion error, since less problematic pieces of context are observed. Nonetheless, in this case, TDA achieves extremely high delay for obtaining a low error. In both cases for all β values, the proposed mechanism with significantly low delay achieves low fusion error (in both types of fusion operators). The IDA approach never outperforms ODA in each case, while TDA for η > 0.5 attempts lower fusion error with one or two orders of magnitude higher delay than that of ODA, thus, yielding it inappropriate for real-time monitoring. It is worth noting that similar behavior will be obtained with other fusion operators that take into consideration the number of current contextual values, since the more non-problematic values we receive the better the accuracy of the event detection. For instance fusion operators over the current context x t could be higher order statistics over the n current measurements, the top-K sources with respect to score functions over their measurements, the outliers of x t using the median absolute deviation about the median [31], or a weighted sum over the current contextual values.

Table 4 Scenario 1: Fusion error e a v g and delay ω (in parenthesis)
Table 5 Scenario 1: Fusion error e min and delay ω (in parenthesis)

In the case we adopt a MVA for missing values imputation before delivering context to the System, we obtain analogous performance of all mechanisms. Tables 6 and 7 show the impact of the adoption of the ES-MVA on the fusion errors for both fusion operators using all approaches. Obviously, by adopting a MVA, we obtain lower fusion errors since the missing values are replaced with the most plausible enough thus, statistically reducing the error. Even in this case, ODA outperforms IDA significantly. This is due to the fact that the ODA takes into account all information (i.e., the series M t ) before proceeding with an optimal decision whether to stop at time t or continue and take the next observation. Recall that the highest possible expected context quality reward is obtained by the stopping rule stated in Theorem 2. This justifies the capability of our mechanism to deliver context of high quality with relatively low delay. The TDA assumes low fusion error but with very high delay compared with the ODA and, obviously, IDA. Overall, in both scenarios (by either adopting MVA algorithms or not) the ODA is deemed as an appropriate mechanism for near real-time analytics assuring high quality of delivered context, thus, improving the quality of MVAs inducing a tolerable delay.

Table 6 Scenario 2: Fusion error e a v g and delay ω (in parenthesis)
Table 7 Scenario 2: Fusion error e min and delay ω (in parenthesis)

6 Conclusions

We introduce a quality-optimized mechanism for delaying context delivery to predictive analytics engines in hope of receiving context of higher quality in data streams, thus eliminating possible biases in knowledge extraction and in decision making. The idea behind this mechanism is to avoid immediately delivering context by introducing a certain controlled delay. The proposed mechanism, based on the principles of optimal stopping theory, proceeds with an optimal stopping rule for delivering context taking into consideration the observation cost and the statistics of the quality indicators seen so far. An analytical stochastic optimization model is proposed and, through experimental evaluation and comparative assessment with a threshold-based and immediate delivery approach, our mechanism is deemed appropriate for adoption especially when the received context is (stochastically) of low quality and the observation cost is not negligible. In our future agenda we study the analysis and development of a mechanism in which the decision time for context delivery is contained within a finite time interval which is application specific.